Forum Moderators: open
Read robots.txt and then left, so I don't know if it actually obeys it or not.
Is this really a new msnbot in beta?
I recently noticed that msnbot 2.0b has actually been hitting some files that my robots.txt tells it to ignore! That kept happening so I just 403'd the bot and later the entire MSIE IP range these keep coming from (due to the fake referer spam from an IE browser at the same IP range). Now look what I find in my logs today (and this is just a sample!):
65.55.106.203 - - [24/May/2009:13:16:09 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:13:17:20 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:13:19:28 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:13:22:31 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:13:26:34 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.51.112 - - [24/May/2009:13:31:27 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:13:31:59 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:13:38:15 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.231 - - [24/May/2009:13:45:27 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.231 - - [24/May/2009:13:53:48 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:14:03:11 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.51.115 - - [24/May/2009:14:03:22 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.104.16 - - [24/May/2009:14:06:51 -0600] "GET /robots.txt HTTP/1.1" 403 787 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.104.16 - - [24/May/2009:14:06:52 -0600] "GET /poetry/ HTTP/1.1" 403 784 "-" "msnbot/1.1 (+http://search.msn.com/msnbot.htm)"
65.55.106.231 - - [24/May/2009:14:13:21 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.203 - - [24/May/2009:14:25:05 -0600] "GET /robots.txt HTTP/1.1" 403 825 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
65.55.106.231 - - [24/May/2009:14:26:26 -0600] "GET /about/ HTTP/1.1" 403 821 "-" "msnbot/2.0b (+http://search.msn.com/msnbot.htm)"
I block bad Java bots, and they don't even do this to me.
(I'm not sure that quoting from other forums is allowed at WebmasterWorld, so refrained from doing so.)
A mod replied on April 15 saying the issue should be fixed shortly. But I was still seeing requests for disallowed directories up right up till 17 May.
The original message makes scary reading, and you have to wonder how M$ unleashed such a badly flawed product on unsuspecting webmasters in Feb and then took over a month to fix it after they were notified in April.
The weird behaviour I am still seeing is that the bot will request a page successfully, then about one minute later request it again with no change in any of the details shown in logs (user agent, IP etc), but receive a redirect as a result of not providing an "Accept-Encoding" header.
So if they know that not sending that header is causing problems, why don't they fix it instead of requesting every page twice?
[edited by: Mokita at 6:27 am (utc) on June 2, 2009]