Hit on robots.txt = non-human visitor?

Forum Moderators: DixonJones

Message Too Old, No Replies

Hit on robots.txt = non-human visitor?

Apart from the webmasters of the world, naturally

SteveJohnston

6:06 pm on Feb 24, 2004 (gmt 0)

Are there circumstances where a normal web surfer would generate a hit on the robots.txt file?

Given the prevalance of automated visitors these days who do little but confuse us with spurious web logs, am I safe to consider a visit that begins with a robots.txt hit, that of a 'bot' of some description, regardless of what it then appears to do or what the user-agent is?

Clearly the above excludes the curious web site rambling of the professional human webmaster, but then they don't count as customers anyway ;-)

Any thoughts people? Should I add it to my filter?

Steve

keyplyr

6:48 pm on Feb 24, 2004 (gmt 0)

Yes, I assume that visits beginning with robots.txt are bots, despite the UA. They also typically will not trail a referrer, and usually request only 1 type of file (either pages or images) but nowdays there are a few (example:YahooSeeker) that will get everything.

pageoneresults

6:52 pm on Feb 24, 2004 (gmt 0)

If the User-agent that you are seeing in the logs is one that you don't want indexing your site, then you can Disallow that User-agent from spidering...

User-agent: Named-bot
Disallow: /

Its the ones that don't request the robots.txt and spider your site that are the pests.

Are there circumstances where a normal web surfer would generate a hit on the robots.txt file?

Normal web surfer? Probably not. Experienced webmaster? Yes. Someone looking to hack you? Possibly.

Be careful what you exclude in your robots.txt file. If security is involved, it should be in a password protected folder with no mention of it in the robots.txt file.

Hmmm, did I answer your question? ;)

If I didn't, keyplyr did.

WebJoe

8:30 pm on Feb 24, 2004 (gmt 0)

Are there circumstances where a normal web surfer would generate a hit on the robots.txt file?

AFAIK yes, the "make available offline" function of IE (post 5.5) checks robots.txt:

2004-02-24 20:20:28 80.218.91.124 - XXX.YY.NN.MMM 80 GET /robots.txt - 404 Mozilla/4.0 (compatible; MSIE 6.0; Windows NT 5.1; .NET CLR 1.1.4322; MSIECrawler)

SteveJohnston

9:21 am on Feb 25, 2004 (gmt 0)

Thanks WebJoe, that was exactly the kind of thing I was looking for and hadn't considered.

And pageoneresults I am keen not to stop them spidering as it is very interesting seeing what is going on on the site. Of course I am beginning to realise there are some agents that I should definitely exclude by using the robots.txt file, as they are patently up to no good.

Thanks all.

Steve