Serving up robots.txt

Forum Moderators: phranque

Message Too Old, No Replies

Serving up robots.txt

Do crawlers/bots adhere to the http location header

jpcooper

8:20 am on May 14, 2003 (gmt 0)

We're attempting to use the /robots.txt requests to filter out search engines for our site access report. I've moved the robots.txt file and have a script which is called inplace of 404 errors. The script detemines if the /robots.txt object was requested, updates a database. It then returns a "location:" header pointing to the actual object with a 302 (Object found) http status.

Will bots then follow this pointer to retreive and parse the object?

There seems to be very little documentation on the net reqarding how http compliant bots are. Obviously using a browser works fine and a few little net resourses for checking the headers return that its valid - but is the same for "REAL" bots?

chris_f

12:34 pm on May 14, 2003 (gmt 0)

Welcome to WebmasterWorld jpcooper,

I think the reason you won't find documentation on se bots is because they constantly change and they would want their competitors getting any good ideas off them. I would setup a small site with no real value to test you theory before putting it in practice.

Chris

jdMorgan

12:50 pm on May 14, 2003 (gmt 0)

jpcooper,

Welcome to WebmasterWorld [webmasterworld.com]!

This sounds like a bad idea to me. The Standard for Robots Exclusion states that the name of the robots-control file will be robots.txt, and makes no explicit allowances for 404 or 302 server responses. Robots need to be simple in order to be fast, and it wouldn't surprise me if many can't handle any redirection.

I would suggest the following:
Use a server-internal (silent) redirect from robots.txt to your script.
Design your script to log the visit, and then open and serve the contents of your actual robots.txt file from within the script itself.
In this way, the robot is never aware that the script was present, and no redirects are necessary.

HTH,
Jim

StanBo

12:51 pm on May 14, 2003 (gmt 0)

It's hardly an efficient way to do that - as of late at least Google bot became known for not requesting robots.txt file at all on more than one occasion. Just as a simple anti SE-spam measure.
/robots.txt is either requested once out of two visits (just to compare the results and mark any cheaters found) or it is requested only when the catalog is being indexed with all the subsequent page request skipping /roots.txt content altogether.
But filtering bots out is not a big problem - most SE bots always use the same network and a very specific useragent to identify themselves. Why don't you just input those into a database?

jpcooper

1:15 pm on May 14, 2003 (gmt 0)

Thanks for your responses - I had wandered whether ALL se bots would adhere to the robots.txt specification. And if, like StanBo suggests, the robots.txt object is NOT requested on every visit then it does mean the access report can be called in to question.

It looks like the method lies between the originating host and the user agent which will produce the best filters.

Thanks again for your suggestions/comments.