redundant robots.txt requests?

Forum Moderators: DixonJones

Message Too Old, No Replies

redundant robots.txt requests?

keyplyr

6:44 pm on Nov 15, 2003 (gmt 0)

Is there are real need for a bot to request the robots.txt 71 times in one visit?


138.23.89.56 - - [15/Nov/2003:06:45:05 -0800] "GET /robots.txt HTTP/1.0" 200 685 "-" "infomine.ucr.edu"

pendanticist

7:50 pm on Nov 15, 2003 (gmt 0)

I can't speak to 71 times.

I can say since Sunday:

138.23.89.[b]xx[/b] - - [10/Nov/2003:21:48:17 -0800] "GET /robots.txt HTTP/1.0" 200 1524 "-" "Infomine Virtual Library Crawler/3.0 (see http://infomine.ucr.edu/projects/vl_crawler/)/1.0"
138.23.89.xx - - [10/Nov/2003:21:56:28 -0800] "GET / HTTP/1.0" 200 20399 "-" "Infomine Virtual Library Crawler/3.0 (see http://infomine.ucr.edu/projects/vl_crawler/)/1.0"

...that has happened four times.

Nothing more, nothing less.

Pendanticist.

keyplyr

8:10 pm on Nov 15, 2003 (gmt 0)

pendanticist - that may be a different bot from UCR. Anyway, they hit robots.txt another dozen times since I first posted. I sent the tech over there an email. After all, we work for the same company.

pendanticist

9:37 am on Nov 16, 2003 (gmt 0)

<shrug>

Pendanticist.

keyplyr

9:41 am on Nov 16, 2003 (gmt 0)

<shrug>

IPs may match but your UA is:


"Infomine Virtual Library Crawler/3.0 (see h*tp://infomine.ucr.edu/projects/vl_crawler/)/1.0"

Quite different from mine :)

keyplyr

12:34 am on Nov 17, 2003 (gmt 0)

Showed up again today and redundantly requested robots.txt 105 times - it just got itself banned until it behaves better.

jdMorgan

12:41 am on Nov 17, 2003 (gmt 0)

This is a long shot, but it's a good idea to check your robots.txt server response, and make sure you have a reasonable Expires header on it. Otherwise, the 'bot will see that your robots.txt is stale, and may re-fetch it for each spidering request. (Yeah, I messed this up myself one time...)

Jim

keyplyr

2:01 am on Nov 17, 2003 (gmt 0)

Thanks Jim, that's certainly something to consider even though no other bots behave this way AFAIK.

Whadoya think?


Server Response: h*tp://www.my-domain.com/robots.txt 
Status: HTTP/1.1 200 OK 
Date: Mon, 17 Nov 2003 01:55:31 GMT 
Server: Apache/1.3.28 (Unix) FrontPage/5.0.2.2510 mod_ssl/2.8.15 OpenSSL/0.9.7a PHP-CGI/0.1b 
Last-Modified: Fri, 14 Nov 2003 04:18:06 GMT 
ETag: "e317af-2ad-3fb4577e" 
Accept-Ranges: bytes 
Content-Length: 685 
Keep-Alive: timeout=2 
Connection: Keep-Alive 
Content-Type: text/plain

</added>
OK - should be interesting to see how long until it gets the message and gives up.
(7 403s so far)


138.23.89.56 - - [16/Nov/2003:17:28:54 -0800] "GET /robots.txt HTTP/1.0" 403 556 "-" "infomine.ucr.edu"

</added>

[edited by: keyplyr at 2:21 am (utc) on Nov. 17, 2003]

jdMorgan

2:21 am on Nov 17, 2003 (gmt 0)

keyplyr,

It appears that you have not specified an expiry time for the file, since there is no Expires header shown. As such, it is up to the client to determine when it will consider the file to be expired, and re-fetch it (or do a HEAD request or Conditional GET to see if it has changed).

In most cases, problems arise because the file is not re-checked often enough, but I suppose in this case it might be re-checked too often -- it depends entirely on how the 'bot was coded. Your headers show you're on Apache, so see the Apache mod_expires documentation for more info.

Jim

keyplyr

2:23 am on Nov 17, 2003 (gmt 0)

It appears that you have not specified an expiry time for the file

I thought that was done at the server level - didn't know I could do it.

<added>
Well I added this to .htaccess but I do not see a change in the header response:


ExpiresDefault "access plus 3 days"

</added>

keyplyr

2:58 am on Nov 17, 2003 (gmt 0)

OK - got it working. Had to turn it on first. So Jim, when you say "have a reasonable Expires header on it", what would you call "reasonable?" I set it to expire 3 days after the first encounter. Thanks.


Server Response: h*tp://www.my-domain.com/robots.txt 
Status: HTTP/1.1 200 OK 
Date: Mon, 17 Nov 2003 02:55:26 GMT 
Server: Apache/1.3.29 (Unix) FrontPage/5.0.2.2510 mod_ssl/2.8.16 OpenSSL/0.9.7c PHP-CGI/0.1b 
Cache-Control: max-age=259200 
Expires: Thu, 20 Nov 2003 02:55:26 GMT 
Last-Modified: Fri, 14 Nov 2003 04:18:06 GMT 
ETag: "e317af-2ad-3fb4577e" 
Accept-Ranges: bytes 
Content-Length: 685 
Keep-Alive: timeout=2 
Connection: Keep-Alive 
Content-Type: text/plain

jdMorgan

3:18 am on Nov 17, 2003 (gmt 0)

<cross-posted> You'll need to enable it with ExpiresActive on. </cross-posted>

Some stuff to consider: If you change your robots.txt file often, use a shorter Expires time. Keep in mind that you'll essentially be telling robots not to re-check it for three days. Since it is at least somewhat feasible that an 'emergency' might arise with respect to robots.txt, I'd advise 12 hours maximum.

The expires time you specify also represents the amount of time you will need to allow when you put up a new page you don't want spidered; you'll have to update robots.txt to Disallow the page, upload it, wait 12 hours, and then upload the page itself. On stable developed sites, I use 6 to 12 hours for robots.txt, pages, and scripts, 30 days for images, and 0 seconds for custom error pages. You can use ExpiresByType or place various ExpiresDefault directives inside <Files> or <FilesMatch> containers to finely control expiry of specific files or groups of files (FilesMatch allows you to specify filenames using regular expressions).

A good rule of thumb is to set the expires time to one half of the usual update period.

Next project: See also mod_headers for including no-cache and must-revalidate cache-control headers. :)

Jim

keyplyr

3:51 am on Nov 17, 2003 (gmt 0)

Geeeeeez, and I thought I'd figured it all out - LOL

Guess a 6 hour roll-over would be better, since I am just declaring it for all files at the moment, until I understand it better. Thanks again.

keyplyr

8:03 am on Nov 17, 2003 (gmt 0)

Since the topic of this thread has evolved into Apache server issues,
I am continuing that subject over here [webmasterworld.com].

However, adding the mod_expired has not been successful in stopping this insatiable bot:


138.23.89.56 - - [17/Nov/2003:01:39:12 -0800] "GET /robots.txt HTTP/1.0" 200 685 "-" "infomine.ucr.edu"
138.23.89.56 - - [17/Nov/2003:01:41:57 -0800] "GET /robots.txt HTTP/1.0" 200 685 "-" "infomine.ucr.edu"
138.23.89.56 - - [17/Nov/2003:01:42:20 -0800] "GET /robots.txt HTTP/1.0" 200 685 "-" "infomine.ucr.edu"
138.23.89.56 - - [17/Nov/2003:01:42:45 -0800] "GET /robots.txt HTTP/1.0" 200 685 "-" "infomine.ucr.edu"
138.23.89.56 - - [17/Nov/2003:01:43:02 -0800] "GET /robots.txt HTTP/1.0" 200 685 "-" "infomine.ucr.edu"
138.23.89.56 - - [17/Nov/2003:01:43:21 -0800] "GET /robots.txt HTTP/1.0" 200 685 "-" "infomine.ucr.edu"

dcrombie

12:55 pm on Nov 17, 2003 (gmt 0)

I had a similar issue with Scooter last week. It took robots.txt over 100 times in one day (almost half it's requests were for robots.txt).

Their response was that because their indexer is 'distributed' it has to fetch the robots.txt every time it visits the site!?!

keyplyr

12:20 am on Nov 18, 2003 (gmt 0)

Received a speedy reply to my email (I used my ucsd.edu email addy) from a tech at infomine.ucr.edu. He basically said he would bring it to the attention of the programers (when they returned from?) but asked if I had any suggestions. While it is indeed refreshing to receive such a compliant attitude, I of course have no understanding of what needs to be changed (if anything.)

frankray

5:23 pm on Nov 21, 2003 (gmt 0)

The same thing is happening to our sites. Lots and lots of repeated and "unnecesary" calls to robots.txt.

Gonna ban them if they do not get on top of this...

keyplyr

7:16 pm on Nov 21, 2003 (gmt 0)

I've banned the IP for 2 days now but the hits continue. Still waiting for the follow-up from the tech dept over there.

keyplyr

8:39 pm on Dec 4, 2003 (gmt 0)

UPDATE: Looks like they have reformed their robot.

I telephoned over there and spoke to a woman in the science lab. After a 2 week absence, the robot has returned with a new name: Infomine Virtual Library Crawler and so far it has been behaving as all good little bots should.