Forum Moderators: goodroi
On of the criteria that I want to give a quantitative measure is how often it requests robots.txt.
I realise this is hardly a new concept, and there was even a proposal to allow robots.txt to specify how often it should be fetched in 1996 (see [robotstxt.org...] however I've seen various behaviors by bots and I'd be interested in peoples opinions (I won't give my own opinion as yet as I want to avoid bias as much as possible at this point :-).
So how often do you think it should ask for robots.txt, what would be too often, and what would be not often enough?
On the other hand, I am most annoyed when a new bot, at its first visit, crawls all the pages even when it does obey the robots.txt. This does not give me the opportunity to evaluate whether this bot is useful to my site or basically just a useless downloader.
While I prefer that one 'bot grabs robots.txt and distributes it to the actual crawlers, this may not work well with huge distributed crawling networks like G and Y! use. I do hope that they will limit the number of crawler instances (IPs) fetching robots.txt to a reasonable number, but this would be a separate 'scoring item' if I were to 'grade' robots.
Also, my "one day" specificaton was a maximum update time, not a minumum. The maximum robots.txt update time is important, and must not be too long. For example, if you wish to upload a page but do not wish it to be indexed, you must update robots.txt to disallow the new page and then wait for this maximum update time before uploading the new page and linking to it. Otherwise, a robot with a stale copy of your robots.txt will feel free to index the new page. So it's important that the maximum update time not be too slow -- Otherwise, this latency can really interfere with site maintenance.
For sevaral major 'bots, the minimum time can be controlled by setting your robots.txt Expires header time to a short value. I found this out the hard way when I set it to "Expires A1" by accident, and several robots responded by re-fetching robots.txt *every time* they requested a different page! :o
Jim
I'm using Apache mod_headers and mod_expires:
# Set up Cache Control headers
ExpiresActive On
# Default - Set header to expire everything 1 week from last access, set must-revalidate
ExpiresDefault A604800
Header append Cache-Control: "must-revalidate"
# Apply a customized expires header to frequently-updated files
<FilesMatch "^robots">
ExpiresDefault A7200
</FilesMatch>
... (more)