homepage Welcome to WebmasterWorld Guest from 54.167.177.180
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

This 152 message thread spans 6 pages: < < 152 ( 1 2 [3] 4 5 6 > >     
MSN's many cloaked bots. Again.
Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 11:44 pm on Aug 5, 2010 (gmt 0)

Previously... [webmasterworld.com]

Currently, straight out of my logs...

65.52.33.73 - - [05/Aug/2010:15:45:09 -0700] "GET /dir/filename.html HTTP/1.1" 403 1468 "-" "-"

No UA, no robots.txt, no REF, no nothing. Not once. Not twice. Not even three times. Try eleven.

65.52.33.73
-
08/05 15:45:09/dir/filename.html
08/05 15:45:20/dir/filename.html
08/05 15:45:31/dir/filename.html
08/05 15:45:42/dir/filename.html
08/05 15:45:53/dir/filename.html
08/05 15:46:03/dir/filename.html
08/05 15:46:14/dir/filename.html
08/05 15:46:25/dir/filename.html
08/05 15:46:35/dir/filename.html
08/05 15:46:46/dir/filename.html
08/05 15:46:57/dir/filename.html

Same poor file. All hits 403'd because no UA; also because bare MSN IP and not a bona fide MSN bot.

 

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 1:27 am on Oct 27, 2010 (gmt 0)

You don't need to disallow all non-existent URL's, only the ones you are getting hits on and don't want to get hits on anymore.

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 4182830 posted 1:52 am on Oct 27, 2010 (gmt 0)

I for one am not yet sure that this is a real msnbot actually controlled by MSN. The "._" at the end of the UA string and the "different" value in the "From" header push me toward "no."

Everyone posting to this thread, please be explicit as to the full UA-string, the "From" header (if you log it), the IP address, and the rDNS if you check it.

Thanks,
Jim

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 2:28 am on Oct 27, 2010 (gmt 0)

In fact two of the topics should be in other threads, as this one started about MSN cloaked bots.
To discuss the MSN bot with the "._" at the end use [webmasterworld.com...]
And maybe start a new thread on msnbot requesting /logs or /access_logs

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 5:38 am on Oct 27, 2010 (gmt 0)

@Dijkgraf,

1.) This thread's OP is my observing/complaining about hits from 'bare MSN IPs and not bona fide MSN bots'. Sure, threads overlap but I think subsequent posts are on-point with my OP because they're describing hits from, well, 'bare MSN IPs and not bona fide MSN bots' -- and that includes the iffy bot with the "._"

2.) You said: "add /logs and /access_log to robots.txt and you won't get any more hits on those" and similarly: "You don't need to disallow all non-existent URL's, only the ones you are getting hits on and don't want to get hits on anymore."

I wish. But as thousands of posts in this forum attest, simply including dirs and files in robots.txt is definitely not a sure-fire way to eliminate hits to same, neither by bad bots nor, increasingly, 'good' ones.

Bottom line for me --

It's not okay for MSN IPs to do anything using anything other than what I say is okay to via robots.txt and .htaccess (via mod_rewrite). The former file states what's allowed (or not). The latter makes sure what's disallowed stays that way. :)

Staffa

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4182830 posted 5:47 pm on Oct 27, 2010 (gmt 0)

It's not okay for MSN IPs to do anything using anything other than what I say is okay to via robots.txt and .htaccess (via mod_rewrite). The former file states what's allowed (or not). The latter makes sure what's disallowed stays that way.

Amen

and that goes for other Bots as well ;o)

Dijkgraaf

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 10:46 pm on Oct 28, 2010 (gmt 0)

@Pfui

But that's the point. You have not told MSN or other bots that they are not allowed to ask for those URL's. With the current standard, anything that isn't explicitly disallowed is allowed. I've seen some very odd requests sometimes, even from GoogleBot including asking for some .exe files. I've put those down to either someone creating fake inbound links to get Google to scan for vulnerabilities or GoogleBot checking for a compromised sites.

Yes, there are bots that won't obey those rules, but those ones you will want to ban outright anyway.

I monitor the 404's occurring on my site, and if a bot starts re-visiting them I disallow it in my robots.txt file or give it a 301 to the resource they should be requesting. Either way it solves the problem, which isn't a big one to start of with, as all that would be happening is that they are getting a 404.

Samizdata

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 12:08 am on Oct 29, 2010 (gmt 0)

all that would be happening is that they are getting a 404

No, the bot also takes the robots.txt file.

If it is a genuine Microsoft bot - and I take Jim's point that the jury is still out on that - then phishing for access logs is disgraceful behaviour for a reputable company.

If it is a fake Microsoft bot using a genuine Microsoft IP then I certainly don't want it reading the robots.txt file or getting any information whatsoever.

Either way, I'd say a 403 is what it deserves.

...

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 5:21 am on Oct 29, 2010 (gmt 0)

Jim, about the iffy "._" bot -- see also my observations in this thread's predecessor, "MSN's many cloaked bots." (2009): [webmasterworld.com...] and yours in "Wanted: Crawler Quality Assurance Engineer" (2010): [webmasterworld.com...]

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 2:18 am on Nov 1, 2010 (gmt 0)

Doggoneit. MSN just ran bingbot from a bare IP:

157.55.16.229
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

robots.txt? NO

keyplyr

WebmasterWorld Senior Member keyplyr us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4182830 posted 3:01 am on Nov 1, 2010 (gmt 0)



MSN just ran bingbot from a bare IP


Thanks for the heads-up Pfui. Guess I was asleep at the wheel and blocked this one.

157.55.16.231 - - [30/Oct/2010:01:50:32 -0700] "GET www.example.com HTTP/1.1" 403 479 "-" "Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)"

Mokita

5+ Year Member



 
Msg#: 4182830 posted 11:04 pm on Nov 5, 2010 (gmt 0)

Just seen coming from rdns msnbot-65-52-49-143.search.msn.com (65.52.49.143), requests for one page plus css, but no images.

UA was exactly: Mozilla/4.0 (compatible

(no closing bracket)

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 9:27 am on Nov 10, 2010 (gmt 0)

Akin to the OP, and also mssg #:4203105, yet another hit-and-run with no UA, no robots.txt, no REF, no nothing. To yet another file. Eleven times. For the third time!

65.52.32.17
-
11/10 00:15:45 /dir/filename.html
11/10 00:15:56 /dir/filename.html
11/10 00:16:07 /dir/filename.html
11/10 00:16:17 /dir/filename.html
11/10 00:16:28 /dir/filename.html
11/10 00:16:39 /dir/filename.html
11/10 00:16:50 /dir/filename.html
11/10 00:17:00 /dir/filename.html
11/10 00:17:11 /dir/filename.html
11/10 00:17:22 /dir/filename.html
11/10 00:17:32 /dir/filename.html

Too weird.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 3:32 am on Nov 11, 2010 (gmt 0)

Same ol', same ol', the dreaded 65.52., albeit a different Host and UA than previously mentioned in this thread:

msnbot-65-52-50-54.search.msn.com
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 5.1; Trident/4.0; SV1; .NET CLR 1.1.4325; .NET CLR 2.0.50727; .NET CLR 3.0.30729; .NET CLR 3.5.30707)

robots.txt? NO

Just went for root.

Staffa

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 4182830 posted 11:19 pm on Nov 12, 2010 (gmt 0)

I noticed a weird event today.
A regular visitor (via G.fr search) comes and views a few pages.

Next comes 207.46.204.nn (msnbot IP) with exactly the same UA as the previous visitor which was what caught my attention.

UA for both :
Mozilla/4.0 (compatible; MSIE 8.0; Windows NT 6.1; Win64; x64; Trident/4.0; GTB6.6; .NET CLR 2.0.50727; SLCC2; .NET CLR 3.5.30729; .NET CLR 3.0.30729; HPDTDF; Tablet PC 2.0; .NET4.0C; Creative AutoUpdate v1.40.01)

My log files are not public and it is either a mighty coincidence or ... a case where msn is following the visitor from France ?

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 6:36 am on Nov 19, 2010 (gmt 0)

Fresh from one of my logs, emphasis mine:

msnbot-65-52-49-143.search.msn.com - - [18/Nov/2010:22:24:15 -0800] "GET / HTTP/1.1" 403 1468 "-" "Mozilla/4.0 (compatible"

Crawlus interruptus?

Mokita

5+ Year Member



 
Msg#: 4182830 posted 10:22 am on Nov 19, 2010 (gmt 0)

Pfui:

Exactly the same details as my post above on Nov 6, (including the rDNS).

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 7:46 pm on Nov 19, 2010 (gmt 0)

(slaps head) Thank you for pointing that out:) Curious how it's even the same IP. And BvsB shows the exact same oddity, along with six out of seven UAs not identified as msnbot- or bing-related from 65.52.49.143: [botsvsbrowsers.com...]

Unfortunately that IP's not the only one using the truncated UA. Here are 22 more: [botsvsbrowsers.com...]

Clearly they know what they're doing. But until I know, non-search UAs from .search.msn.com Hosts/IPs will just keep getting 403s.

Transparency. What a concept. (Note the date...) [bing.com:80...]

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 9:54 am on Nov 20, 2010 (gmt 0)

More bare IPs from 65.52. Hit in a post-tweet swarm, 20 minutes apart. All using:

Mozilla/4.0 (compatible; MSIE 7.0; Windows NT 6.0)

65.52.4.249
65.52.2.10
65.52.17.79

robots.txt? NO

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 2:24 am on Nov 28, 2010 (gmt 0)

Ten same-second hits to the same site. So much for MSN stating msnbot supports "Crawl-delay" in robots.txt: Crawl Delay And The Bing Crawler, MSNBot [bing.com]

msnbot-65-55-190-9.search.msn.com
msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm)
11/27 03:45:34 /robots.txt
11/27 03:45:34 /file1

65.55.190.19
msnbot-NewsBlogs/2.0b (+http://search.msn.com/msnbot.htm)
11/27 03:45:34 /robots.txt
11/27 03:45:34 /robots.txt
11/27 03:45:34 /file2
11/27 03:45:34 /file3
11/27 03:45:34 /robots.txt
11/27 03:45:34 /robots.txt
11/27 03:45:34 /file4
11/27 03:45:34 /file5

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 5:40 am on Dec 16, 2010 (gmt 0)

Same Hostname caribguy reported above. At least asked for robots.txt and heeded same. But crawled 20 documents whereas bing/msnbot usually crawls 1 or 2 per server session. Also, the same weird UA was in play:

gig4-2.tuk2f-gsr-a.us.msn.net
msnbot/2.0b (+http://search.msn.com/msnbot.htm)._

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4182830 posted 9:21 pm on Dec 16, 2010 (gmt 0)

I block all those UAs. Apart from anything else they seem to come from non-rDNS IPs.

Pfui

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 9:09 am on Jan 17, 2011 (gmt 0)

The odd one came around on the 16th --

msnbot-65-55-55-205.search.msn.com
msnbot/2.0b (+http://search.msn.com/msnbot.htm)._

robots.txt? Yes but... "Crawl-delay" directive ignored.

-- and alternated with the usual suspects in a 15-minute period with hits from Hosts and bare IPs. Here's a partial listing:

msnbot-65-52-110-69.search.msn.com
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

msnbot-207-46-194-144.search.msn.com
msnbot-media/1.1 (+http://search.msn.com/msnbot.htm)

157.55.16.230
Mozilla/5.0 (compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm)

FWIW

incrediBILL

WebmasterWorld Administrator incredibill us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4182830 posted 10:19 pm on Jan 17, 2011 (gmt 0)

I'm guessing that bingbot from bare IPs is either:

- running from a test lab
- run via some MS cloud services
- being spoofed via some MS proxy

Doesn't matter, if the IP doesn't resolve to a crawler rDNS, I block it.

FYI - these bare IPs are showing up on Project Honeypot: [projecthoneypot.org...]

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4182830 posted 9:45 pm on Feb 7, 2011 (gmt 0)

Microsoft seems to have legitimized a few more bot IPs by setting up proper rDNS for them. The range below may not be ALL of the new ones but seems to be correct for 157.55.116 (although the bot is still coming round with the "invalid" underscore UA and ignoring robots.txt on this and other ranges).

157.55.116.7 - 157.55.116.97

Bots still hitting hard but with no valid rDNS are in the range:

157.55.16.0 - 157.55.18.255

AlexK

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 4:36 am on Jun 24, 2011 (gmt 0)

4 days ago the msnbot starting hitting my site at up to 12 hits / second from multiple IPs; thousands & thousands of attempted scrapes. I've had routines in place for years to auto-stop bad-bots. Those routines have been augmented to record & report them (both to the tornvall RBL & also on a public webpage as a permanent record). Here's the reports; so far, the same behaviour for 4 days on the trot:

24 June:
65.52.110.75 [forums.modem-help.co.uk] : msnbot-65.52.110.75.search.msn.com : max: 9 / sec : total: 26 pages
207.46.204.238 [forums.modem-help.co.uk] : msnbot-207.46.204.238.search.msn.com : max: 11 / sec : total: 2,214 pages
207.46.199.43 [forums.modem-help.co.uk] : msnbot-207.46.199.43.search.msn.com : max: 8 / sec : total: 548 pages

23 June:
207.46.13.98 [forums.modem-help.co.uk] : msnbot-207.46.13.98.search.msn.com : max: 12 / sec : total: 3,272 pages

22 June:
207.46.13.98 [forums.modem-help.co.uk] : msnbot-207.46.13.98.search.msn.com : max: 12 / sec : total: 1,395 pages
65.52.110.64 [forums.modem-help.co.uk] : msnbot-65.52.110.64.search.msn.com : max: 7 / sec : total: 25 pages
65.52.110.72 [forums.modem-help.co.uk] : msnbot-65.52.110.72.search.msn.com : max: 12 / sec : total: 1,888 pages
207.46.195.242 [forums.modem-help.co.uk] : msnbot-207.46.195.242.search.msn.com : max: 12 / sec : total: 1,201 pages

21 June:
65-52-110-72 [forums.modem-help.co.uk] : msnbot-65-52-110-72.search.msn.com : max: 12 / sec : total: 1,424 pages
207-46-195-242 [forums.modem-help.co.uk] : msnbot-207-46-195-242.search.msn.com : max: 12 / sec : total: 926 pages
207-46-199-38 [forums.modem-help.co.uk] : msnbot-207-46-199-38.search.msn.com : max: 8 / sec : total: 26 pages

A full report has also been auto-emailed to the MS abuse address each day for each IP. Frankly, I do not expect any action.

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4182830 posted 9:02 pm on Jun 24, 2011 (gmt 0)

Had a lot of hits from msnbot recently - coming round on some sites once a day - but behaviour seems reasonable.

That's specific known msnbot UA plus IP, of course. There are always a few non-bot UAs or badly formatted ones, plus the odd bad rDNS IP (your IP examples are good).

AlexK

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 11:16 pm on Jun 24, 2011 (gmt 0)

@dstiles

It's a Microsoft ASN for each IP.

I've seen reports in the past for such behaviour from the msnbot, back in the day before the bingbot, but this is the first time that I've picked it up personally. The stop-block-report routines do not pick up the UA, only info (such as port number) that the hostmaster requires to confirm the abuse. Not that I expect--or have received--anything more than an acknowledgement of receipt from MS (abuse@hotmail.com).

dstiles

WebmasterWorld Senior Member dstiles us a WebmasterWorld Top Contributor of All Time 5+ Year Member



 
Msg#: 4182830 posted 9:27 pm on Jun 25, 2011 (gmt 0)

Not sure what you mean by stop-block-report. Is this something on your machine that reports problems?

I get my info from the site logs (I also record abuses in a separate log): that shows UA and IP. If I suspect a problem I check the rDNS using the linux Network Tools or online robtex.

I've never bothered to report msnbot/bingbot abuse.

AlexK

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4182830 posted 6:36 am on Jun 26, 2011 (gmt 0)

dstile:
Not sure what you mean by stop-block-report. Is this something on your machine that reports problems?

The original Bad-Bot-Blocker [forums.modem-help.co.uk], with added vitamins! Originated on WebmasterWorld [webmasterworld.com] & developed by myself. The bot-blocker is the `stop-block' part. I've added the capability to record the abuse, auto-report (both to abuse email + to RBL site) & provide permanent evidence. I'm a bit proud of it.

The constant scrapes have now almost entirely gone, although that could be because each IP now gets an auto-block. It still beggars the question: what on earth are MS doing?

tangor

WebmasterWorld Senior Member tangor us a WebmasterWorld Top Contributor of All Time 5+ Year Member Top Contributors Of The Month



 
Msg#: 4182830 posted 6:59 am on Jun 26, 2011 (gmt 0)

4 days ago the msnbot starting hitting my site at up to 12 hits / second from multiple IPs; thousands & thousands of attempted scrapes. I've had routines in place for years to auto-stop bad-bots.


Is Bing a "bad bot"? Just asking. Google is proving short on supply and Bing on the rise... which is the "bad bot"?

Just curious.

Mokita

5+ Year Member



 
Msg#: 4182830 posted 8:40 am on Jun 26, 2011 (gmt 0)

@AlexK

Bingbot/MsnBot claim to honour a "Crawl-delay" directive in robots.txt :

[bing.com...]

Did you try that method before/instead of using the sledge-hammer approach?

If you had, your complaints to MS Abuse would probably carry more weight.

... Just a suggestion (as I haven't experienced the abuse you mention but I do have a "Crawl-delay" setting)

<edit> BingBot is crawling our sites heavily - but honouring the Crawl-delay </edit>

This 152 message thread spans 6 pages: < < 152 ( 1 2 [3] 4 5 6 > >
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved