Frequencey of GET /robots.txt

Forum Moderators: goodroi

Message Too Old, No Replies

Frequencey of GET /robots.txt

How often should a crawler/spider/robot ask for robots.txt?

Dijkgraaf

10:55 pm on Jun 2, 2005 (gmt 0)

I've been working on creating a rating system for crawler/spiders/robots based on their behaviour.

On of the criteria that I want to give a quantitative measure is how often it requests robots.txt.

I realise this is hardly a new concept, and there was even a proposal to allow robots.txt to specify how often it should be fetched in 1996 (see [robotstxt.org...] however I've seen various behaviors by bots and I'd be interested in peoples opinions (I won't give my own opinion as yet as I want to avoid bias as much as possible at this point :-).

So how often do you think it should ask for robots.txt, what would be too often, and what would be not often enough?

jdMorgan

11:05 pm on Jun 2, 2005 (gmt 0)

I'd prefer:

At least once per day, but otherwise no sooner than the HTTP "Expires" response header indicates, per spider IP address (for those that have multiple instances on different machines).

Jim

Staffa

1:32 am on Jun 3, 2005 (gmt 0)

I would agree, once per day is enough rather than at each visit of each IP number of the same bot which can add up to a lot of times per day.

On the other hand, I am most annoyed when a new bot, at its first visit, crawls all the pages even when it does obey the robots.txt. This does not give me the opportunity to evaluate whether this bot is useful to my site or basically just a useless downloader.

Dijkgraaf

1:59 am on Jun 3, 2005 (gmt 0)

Hi Staffa. You could by default ban all bots except the ones you think you are usefull. You could vary this a bit by having a few pages that all bots are allowed to get to e.g. have some pages in the root directory, but ban all bots (except those that you like) from accessing the sub directories.

Dijkgraaf

2:02 am on Jun 3, 2005 (gmt 0)

Thanks jdMoragan, I'll have a look at the HTTP expires idea.

jdMorgan

2:47 am on Jun 3, 2005 (gmt 0)

Notice that I explicitly *allowed* for multiple instances of the same robot using different IP addresses in my comments above.

While I prefer that one 'bot grabs robots.txt and distributes it to the actual crawlers, this may not work well with huge distributed crawling networks like G and Y! use. I do hope that they will limit the number of crawler instances (IPs) fetching robots.txt to a reasonable number, but this would be a separate 'scoring item' if I were to 'grade' robots.

Also, my "one day" specificaton was a maximum update time, not a minumum. The maximum robots.txt update time is important, and must not be too long. For example, if you wish to upload a page but do not wish it to be indexed, you must update robots.txt to disallow the new page and then wait for this maximum update time before uploading the new page and linking to it. Otherwise, a robot with a stale copy of your robots.txt will feel free to index the new page. So it's important that the maximum update time not be too slow -- Otherwise, this latency can really interfere with site maintenance.

For sevaral major 'bots, the minimum time can be controlled by setting your robots.txt Expires header time to a short value. I found this out the hard way when I set it to "Expires A1" by accident, and several robots responded by re-fetching robots.txt *every time* they requested a different page! :o

Jim

Dijkgraaf

3:05 am on Jun 3, 2005 (gmt 0)

How are you setting the Expires heading on the robots.txt file?

Fetching the robots.txt file before every file seems to be the standard behaviour of some bots, either that or they have their minimum robots fetch time set so close to their maximum page fetch rate that it ammounts to the same thing.

Staffa

9:15 am on Jun 3, 2005 (gmt 0)

Thanks Dijkgraaf, that's a simple and neat suggestion.

It's too late for my 5 yr old site to reshuffle the pages in the root (doing too well with the main players) but I'll keep it mind when I build the next site.

jdMorgan

5:08 am on Jun 18, 2005 (gmt 0)

> How are you setting the Expires heading on the robots.txt file?

I'm using Apache mod_headers and mod_expires:


# Set up Cache Control headers
ExpiresActive On
# Default - Set header to expire everything 1 week from last access, set must-revalidate
ExpiresDefault A604800
Header append Cache-Control: "must-revalidate"
# Apply a customized expires header to frequently-updated files
<FilesMatch "^robots">
ExpiresDefault A7200
</FilesMatch>
... (more)

Jim