Forum Moderators: DixonJones
I can say since Sunday:
138.23.89.[b]xx[/b] - - [10/Nov/2003:21:48:17 -0800] "GET /robots.txt HTTP/1.0" 200 1524 "-" "Infomine Virtual Library Crawler/3.0 (see http://infomine.ucr.edu/projects/vl_crawler/)/1.0"
138.23.89.xx - - [10/Nov/2003:21:56:28 -0800] "GET / HTTP/1.0" 200 20399 "-" "Infomine Virtual Library Crawler/3.0 (see http://infomine.ucr.edu/projects/vl_crawler/)/1.0" <Where values 'x' equal those of the initial poster.>
...that has happened four times.
Nothing more, nothing less.
Pendanticist.
Whadoya think?
Server Response: h*tp://www.my-domain.com/robots.txt
Status: HTTP/1.1 200 OK
Date: Mon, 17 Nov 2003 01:55:31 GMT
Server: Apache/1.3.28 (Unix) FrontPage/5.0.2.2510 mod_ssl/2.8.15 OpenSSL/0.9.7a PHP-CGI/0.1b
Last-Modified: Fri, 14 Nov 2003 04:18:06 GMT
ETag: "e317af-2ad-3fb4577e"
Accept-Ranges: bytes
Content-Length: 685
Keep-Alive: timeout=2
Connection: Keep-Alive
Content-Type: text/plain
</added>
OK - should be interesting to see how long until it gets the message and gives up.
(7 403s so far)
138.23.89.56 - - [16/Nov/2003:17:28:54 -0800] "GET /robots.txt HTTP/1.0" 403 556 "-" "infomine.ucr.edu"
[edited by: keyplyr at 2:21 am (utc) on Nov. 17, 2003]
It appears that you have not specified an expiry time for the file, since there is no Expires header shown. As such, it is up to the client to determine when it will consider the file to be expired, and re-fetch it (or do a HEAD request or Conditional GET to see if it has changed).
In most cases, problems arise because the file is not re-checked often enough, but I suppose in this case it might be re-checked too often -- it depends entirely on how the 'bot was coded. Your headers show you're on Apache, so see the Apache mod_expires documentation for more info.
Jim
Server Response: h*tp://www.my-domain.com/robots.txt
Status: HTTP/1.1 200 OK
Date: Mon, 17 Nov 2003 02:55:26 GMT
Server: Apache/1.3.29 (Unix) FrontPage/5.0.2.2510 mod_ssl/2.8.16 OpenSSL/0.9.7c PHP-CGI/0.1b
Cache-Control: max-age=259200
Expires: Thu, 20 Nov 2003 02:55:26 GMT
Last-Modified: Fri, 14 Nov 2003 04:18:06 GMT
ETag: "e317af-2ad-3fb4577e"
Accept-Ranges: bytes
Content-Length: 685
Keep-Alive: timeout=2
Connection: Keep-Alive
Content-Type: text/plain
Some stuff to consider: If you change your robots.txt file often, use a shorter Expires time. Keep in mind that you'll essentially be telling robots not to re-check it for three days. Since it is at least somewhat feasible that an 'emergency' might arise with respect to robots.txt, I'd advise 12 hours maximum.
The expires time you specify also represents the amount of time you will need to allow when you put up a new page you don't want spidered; you'll have to update robots.txt to Disallow the page, upload it, wait 12 hours, and then upload the page itself. On stable developed sites, I use 6 to 12 hours for robots.txt, pages, and scripts, 30 days for images, and 0 seconds for custom error pages. You can use ExpiresByType or place various ExpiresDefault directives inside <Files> or <FilesMatch> containers to finely control expiry of specific files or groups of files (FilesMatch allows you to specify filenames using regular expressions).
A good rule of thumb is to set the expires time to one half of the usual update period.
Next project: See also mod_headers for including no-cache and must-revalidate cache-control headers. :)
Jim
However, adding the mod_expired has not been successful in stopping this insatiable bot:
138.23.89.56 - - [17/Nov/2003:01:39:12 -0800] "GET /robots.txt HTTP/1.0" 200 685 "-" "infomine.ucr.edu"
138.23.89.56 - - [17/Nov/2003:01:41:57 -0800] "GET /robots.txt HTTP/1.0" 200 685 "-" "infomine.ucr.edu"
138.23.89.56 - - [17/Nov/2003:01:42:20 -0800] "GET /robots.txt HTTP/1.0" 200 685 "-" "infomine.ucr.edu"
138.23.89.56 - - [17/Nov/2003:01:42:45 -0800] "GET /robots.txt HTTP/1.0" 200 685 "-" "infomine.ucr.edu"
138.23.89.56 - - [17/Nov/2003:01:43:02 -0800] "GET /robots.txt HTTP/1.0" 200 685 "-" "infomine.ucr.edu"
138.23.89.56 - - [17/Nov/2003:01:43:21 -0800] "GET /robots.txt HTTP/1.0" 200 685 "-" "infomine.ucr.edu"
Their response was that because their indexer is 'distributed' it has to fetch the robots.txt every time it visits the site!?!
Received a speedy reply to my email (I used my ucsd.edu email addy) from a tech at infomine.ucr.edu. He basically said he would bring it to the attention of the programers (when they returned from?) but asked if I had any suggestions. While it is indeed refreshing to receive such a compliant attitude, I of course have no understanding of what needs to be changed (if anything.)