Forum Moderators: open

Message Too Old, No Replies

Mozilla/4.0; (Compitable; MSIE 6.0; Windows NT 5.1;)

MS is doing sneaky stuff

         

GaryK

8:52 pm on Oct 15, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Mozilla/4.0; (Compitable; MSIE 6.0; Windows NT 5.1;)
207.46.89.14
tide121.microsoft.com

Poor Microsoft. They can't even spell compatible or format a ua properly. :)

Looks like I owe you an apology Don. The week after I told you MS has never pulled any sneaky stuff with me they went and did just that!

No robots.txt. It looks like they're going after default pages only. Hit me twice two days apart. Sure wish I could ban them by IP but they share a similar range with MSN's legitimate bots. I can probably ban at the C class. Would this be a dumb move on my part?

wilderness

12:43 am on Oct 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



They can't even spell compatible or format a ua properly

Hey Gary,
I wonder if perhaps the word is not mispelled purposely to cirumvent browser ID?

Don

wilderness

12:48 am on Oct 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



BTW an while we're with MSN in mind?

Anybody aware of what UA is their main crawler?

I grew weary of the MSN bot varitions and began by denying the newest in the 74 Class.

Then added all the following to my robots a few days later:

ser-agent: MSNPTC
Disallow: /

User-agent: msnbot-MM
Disallow: /

User-agent: msnbot-products
Disallow: /

User-agent: MSRBOT
Disallow: /

User-agent: msnbot-media
Disallow: /

As a result I'm not getting spidered by nothing at MS.

GaryK

12:55 am on Oct 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I saw a thread listing the new names for all the MSN crawlers. The only ones you didn't list are:

lanshanbot
msnbot-news
msnbot-NewsBlogs
msnbot
msnbot/1.0-MM

EDIT: I found the thread.
[webmasterworld.com...]

[edited by: GaryK at 12:57 am (utc) on Oct. 16, 2006]

wilderness

3:43 pm on Oct 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



It appears to me that MS is without capability to differentiate between the two bots.
I'm going to remove the robots.txt assertion and add a Class C denial of the media bot.

65.55.212.184 - - [16/Oct/2006:04:20:53 -0700] "GET /robots.txt HTTP/1.0" 200 4527 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"

65.54.188.148 - - [15/Oct/2006:23:21:56 -0700] "GET /robots.txt HTTP/1.0" 200 4540 "-" "msnbot/1.0 (+http://search.msn.com/msnbot.htm)"

-----------
left the IP ranges alone.
added the following three lines:

SetEnvIf UserAgent "msnbot-MM" keep_out
SetEnvIf UserAgent "msnbot-products" keep_out
SetEnvIf UserAgent "msnbot-media" keep_out

I'll update on the results.

wilderness

7:41 pm on Oct 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



yepper.
Started crawling immediately.

Has no disctinction between the different bots as advised in:
[webmasterworld.com...]

#:3028975

jdMorgan

8:04 pm on Oct 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Don,

You might want to use SetEnvIfNoCase -- MS doesn't enforce a corporate User-agent-name standard, apparently, so sometimes the second part of the UA name is capitalized in the request header, and sometimes it isn't.

I have noticed that the msnbots have apparently been "improved" sometime in the past 48 hours and that msnbot/1.0 (search) will now accept records specifying "msnbot/" while "msnbot-media", msnbot-news", or "msnbot-Products" will no longer match that string (because of the trailing slash). In other words, the following now works correctly to allow msnbot, but deny all of the MSN specialty bots:

# Allow msnbot/0.9 and msnbot/1.0 (search bots) except for cgi-bin
User-agent: msnbot/
Disallow: /cgi-bin
#
# Disallow all others
User-agent: *
Disallow: /


Jim

wilderness

8:32 pm on Oct 16, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



thanks Jim.