Forum Moderators: phranque
I have the following bot turning up regularly on my logs and am trying to work out if it is a good or bad one. It doesn't get into my bot traps but some of its requests seem unusual.
It resolves to Microsoft Corp with a WHOIS lookup.
Here are a couple of error/access entries
65.55.212.*** - - [27/Oct/2007:08:33:53 +0100] "GET /%7Edomainname/rss.xml HTTP/1.0" 404 1232 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"
[Sat Oct 27 08:33:53 2007] [error] [client 65.55.212.***] File does not exist: /var/www/vhosts/domainname.org.uk/web_users/domainname
Would a legitimate spider be looking for a non-existent directory called web_users? Or at least one that isn't on search indexes and is an Apache thing rather than visible via ordinary links?
I've checked on MSN search for entries relating to my domain, and there is an enormous amount of content that has been denied in robots.txt for several weeks but is still on their index and I can't see any tools to get it deleted (whereas yahoo and google do have such tools).
The WHOIS says
OrgName: Microsoft Corp
OrgID: MSFT
Address: One Microsoft Way
City: Redmond
StateProv: WA
PostalCode: 98052
Country: US
NetRange: 65.52.0.0 - 65.55.255.255
CIDR: 65.52.0.0/14
NetName: MICROSOFT-1BLK
NetHandle: NET-65-52-0-0-1
Parent: NET-65-0-0-0-0
Content denied in robots.txt will slowly fall out of the index. In this case, it might take up to a year, during which time you can use it as a reminder to give robots.txt maintenance a bit more priority on the next project... ;)
Jim
With regard to the links - I've done a Windows search for the string "web_users" on HDD copies of the site, and I can definitely say I have no links containing that string - the site was put together before ever I knew there was a web_users folder in the Apache domain setup and I certainly never go there. So what sort of poorly formatted link do you have in mind? Are you saying that a badly formatted link might send someone there?
So it is not that MSbot is requesting that part, but that the server prepends that path to the requested URL in order to convert the requested URL to a server filepath.
Try requesting a non-existent file yourself, and you can confirm this -- as well as determine precisely what your server uses as the "DocumentRoot" filepath for your site.
Jim