Forum Moderators: open

Message Too Old, No Replies

msnbot/2.0b

New crawler from MS in the works?

         

GaryK

5:04 pm on Feb 1, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



msnbot/2.0b ( [search.msn.com...]
131.107.0.95
tide525.microsoft.com

Read robots.txt and then left, so I don't know if it actually obeys it or not.

Is this really a new msnbot in beta?

wilderness

3:37 pm on Feb 13, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Consistency at its best ;)

65.55.106.132 - - [13/Feb/2009:12:12:13 +0000] "GET /robots.txt HTTP/1.1" 200 5023 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.132 - - [13/Feb/2009:12:13:14 +0000] "GET /MyFolder/MyPage.html HTTP/1.0" 200 41524 "-" "msnbot/2.0b"

GaryK

9:25 pm on Feb 13, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Do they get credit for being consistent in their inconsistency?

wilderness

9:34 pm on Feb 13, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Only credit I'm able to extend is "eat 403's"

santapaws

9:00 am on Feb 16, 2009 (gmt 0)

10+ Year Member



the last two days this bot from range 65.52.0.0 - 65.55.255.255 has been ignoring robots and hitting the spider trap. is this for real? It gets banned on one ip it hits with another. I guess i need to block the whole range? A major company like this throwing out the rule book for ethical spidering?

GaryK

5:49 pm on Feb 16, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



ethical spidering

That's a term I'm unfamiliar with. ;)

It was back again the last few days. This time it read not only robots.txt but the default root page as well. So for me it's been entirely respectful of robots.txt.

santapaws

9:08 am on Feb 19, 2009 (gmt 0)

10+ Year Member



still coming in off different ips and hitting the bot trap. is this thing for real?

GaryK

4:21 pm on Feb 19, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



We've already established that MS says the user agent is legit. From there though you have to take your own steps to ensure it's not being spoofed.

santapaws

5:10 pm on Feb 19, 2009 (gmt 0)

10+ Year Member



"is this thing for real" is a figure of speech.

SoyDevon

3:20 am on May 25, 2009 (gmt 0)

10+ Year Member



Hi, this is a very interesting discussion to me as I've gotten quite annoyed with msnbot 2.0 and the MSIE-from-MS-IP's that puts fake referers in my logs. This should've been resolved months ago. It was bad enough the bot never listened to my 301's, but this is ridiculous.

I recently noticed that msnbot 2.0b has actually been hitting some files that my robots.txt tells it to ignore! That kept happening so I just 403'd the bot and later the entire MSIE IP range these keep coming from (due to the fake referer spam from an IE browser at the same IP range). Now look what I find in my logs today (and this is just a sample!):

65.55.106.203 - - [24/May/2009:13:16:09 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:13:17:20 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:13:19:28 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:13:22:31 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:13:26:34 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.51.112 - - [24/May/2009:13:31:27 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:13:31:59 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:13:38:15 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.231 - - [24/May/2009:13:45:27 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.231 - - [24/May/2009:13:53:48 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:14:03:11 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.51.115 - - [24/May/2009:14:03:22 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.104.16 - - [24/May/2009:14:06:51 -0600] "GET /robots.txt HTTP/1.1" 403 787 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.104.16 - - [24/May/2009:14:06:52 -0600] "GET /poetry/ HTTP/1.1" 403 784 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.106.231 - - [24/May/2009:14:13:21 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:14:25:05 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.231 - - [24/May/2009:14:26:26 -0600] "GET /about/ HTTP/1.1" 403 821 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"

I block bad Java bots, and they don't even do this to me.

caribguy

4:27 am on May 25, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



SoyDevon, be glad Spinn3r hasn't found you yet. 403's are like crystal meth to them.

Mokita

6:20 am on Jun 2, 2009 (gmt 0)

10+ Year Member



I've just been researching this beta bot due to its bad or strange behaviour and stumbled across a message from mid-April in the Microsoft Webmaster Forums with a title of "msnbot is using MY robots.txt to crawl YOUR site".

(I'm not sure that quoting from other forums is allowed at WebmasterWorld, so refrained from doing so.)

A mod replied on April 15 saying the issue should be fixed shortly. But I was still seeing requests for disallowed directories up right up till 17 May.

The original message makes scary reading, and you have to wonder how M$ unleashed such a badly flawed product on unsuspecting webmasters in Feb and then took over a month to fix it after they were notified in April.

The weird behaviour I am still seeing is that the bot will request a page successfully, then about one minute later request it again with no change in any of the details shown in logs (user agent, IP etc), but receive a redirect as a result of not providing an "Accept-Encoding" header.

So if they know that not sending that header is causing problems, why don't they fix it instead of requesting every page twice?

[edited by: Mokita at 6:27 am (utc) on June 2, 2009]

This 41 message thread spans 2 pages: 41