Forum Moderators: phranque

Message Too Old, No Replies

is this bot friendly or not?

deciphering a log entry about a bot

         

revrob

12:09 pm on Oct 27, 2007 (gmt 0)

10+ Year Member



Relative newbie to website security and on a steep learning curve. Thanks in advance for any advice - I've learnt a lot here so far.

I have the following bot turning up regularly on my logs and am trying to work out if it is a good or bad one. It doesn't get into my bot traps but some of its requests seem unusual.
It resolves to Microsoft Corp with a WHOIS lookup.

Here are a couple of error/access entries

65.55.212.*** - - [27/Oct/2007:08:33:53 +0100] "GET /%7Edomainname/rss.xml HTTP/1.0" 404 1232 "-" "msnbot-media/1.0 (+http://search.msn.com/msnbot.htm)"

[Sat Oct 27 08:33:53 2007] [error] [client 65.55.212.***] File does not exist: /var/www/vhosts/domainname.org.uk/web_users/domainname

Would a legitimate spider be looking for a non-existent directory called web_users? Or at least one that isn't on search indexes and is an Apache thing rather than visible via ordinary links?

I've checked on MSN search for entries relating to my domain, and there is an enormous amount of content that has been denied in robots.txt for several weeks but is still on their index and I can't see any tools to get it deleted (whereas yahoo and google do have such tools).

The WHOIS says
OrgName: Microsoft Corp
OrgID: MSFT
Address: One Microsoft Way
City: Redmond
StateProv: WA
PostalCode: 98052
Country: US

NetRange: 65.52.0.0 - 65.55.255.255
CIDR: 65.52.0.0/14
NetName: MICROSOFT-1BLK
NetHandle: NET-65-52-0-0-1
Parent: NET-65-0-0-0-0

jdMorgan

3:34 pm on Oct 28, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It looks like Microsoft's media bot following a badly-formatted link to me. You might want to try a search for that specific URL, and see if you can find the source of the bad link.

Content denied in robots.txt will slowly fall out of the index. In this case, it might take up to a year, during which time you can use it as a reminder to give robots.txt maintenance a bit more priority on the next project... ;)

Jim

revrob

3:58 pm on Oct 28, 2007 (gmt 0)

10+ Year Member



Thank you. Point taken about robots.txt - as I said - learning curve! For years I had the site on free webspace with no php options, so I just relied on metatags - when we moved to our present host, I installed the php stuff before thinking about robots.txt so it ended up on the indexes. I now know better (although the index entries do provide material for the bot traps as they follow out of date links to little tar pits on the site!)

With regard to the links - I've done a Windows search for the string "web_users" on HDD copies of the site, and I can definitely say I have no links containing that string - the site was put together before ever I knew there was a web_users folder in the Apache domain setup and I certainly never go there. So what sort of poorly formatted link do you have in mind? Are you saying that a badly formatted link might send someone there?

jdMorgan

4:22 pm on Oct 28, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



At least this part of the filepath is determined by your host, and is not part of the URL: /var/www/vhosts/domainname.org.uk/web_users/

So it is not that MSbot is requesting that part, but that the server prepends that path to the requested URL in order to convert the requested URL to a server filepath.

Try requesting a non-existent file yourself, and you can confirm this -- as well as determine precisely what your server uses as the "DocumentRoot" filepath for your site.

Jim