Web page timeout for web bots

Forum Moderators: phranque

Message Too Old, No Replies

Web page timeout for web bots

How long it takes before a web bot gives up on a page and what happens

andymerrett

12:58 pm on Mar 8, 2005 (gmt 0)

Hi,

I am interested to know how long an automated visitor - ie a web bot, spider etc - will try to access a web page before it gives up. Is this based on a server setting (I recall reading something about Apache web server setting up timeout value). Would it get exactly the same response as a human visitor (ie if I visit a web page that doesn't load and times out after 30 seconds, would a bot get the same response?)

I know this is extremely generalised, given the plethora of different bots, some legitimate and some not, but I am interested in the technology - not because I want to develop one but because I am combatting 'bad' bots.

My dream is that a bad bot will crash itself if it tries to access a page that exists (200 code) but takes too long to load (say, 11 days? :) What *might* happen in this case?

Any ideas - bots behaviour fascinates me, even when they are used unethically.

grandpa

1:06 pm on Mar 8, 2005 (gmt 0)

Would it get exactly the same response as a human visitor (ie if I visit a web page that doesn't load and times out after 30 seconds, would a bot get the same response?)

I can answer this part. Yes. I've see my index page served up in a search engine with the title Internal Server Error. The URL reference was still valid, and the page would be served properly - once the server error was remedied. Usually it is because of a timeout.

Not a pretty sight!

andymerrett

1:25 pm on Mar 8, 2005 (gmt 0)

So if the page was still valid (ie didn't come back with a 404 or a 500) - but just took a long time to load, would the bot wait for the normal server timeout (I assume one is set) or might it have its own timeout value (ie if page doesn't complete loading after 30 seconds, move on)?

I guess the bots are sophisticated enough not to be 'crashed' by this - if a bot is looking for HTML FORMs to fill in, it isn't going to be crashed by non-existent or server error pages, if it even interprets the returning status code.

grandpa

2:04 pm on Mar 8, 2005 (gmt 0)

I think you should look at the bot as being nothing more than a browser without the page rendering ability. Whatever happens in a browser will happen with a bot. The timeout is a function of the server.

However, I would surmise that a bot can be programmed to retry to access the page more than once. I don't think a bot can be programmed to 'sit idle' waiting for a response, for days at a time - but that's not really an area I've explored.

andymerrett

2:21 pm on Mar 8, 2005 (gmt 0)

The reason I ask is because I have a PHP page which generates a form bit by bit, with built-in delays (yes, it's a bad-bot-trap, legitimate bots never see it).

There would be no way for the bot to avoid this time delay, so it would just timeout, either because it gives up, or because the server boots it? But would the server boot a connection if it is still serving a page to them?

DanA

2:56 pm on Mar 8, 2005 (gmt 0)

You could try with an offline browser (winhttrack for example).
You would see that the number of active connections, the bandwidth usage, the timeout, the number of retries... can be set.
Your form won't even slow down the spider.
Many bots only send a HEAD request and don't care about timeouts.
Some compare the answer with the info already stored (date or/and time or/and CRC) and then send a GET request if a programmed condition is met.
Some as Yaho Slurp will retry infinitely...

DanA

2:57 pm on Mar 8, 2005 (gmt 0)

andymerrett

3:36 pm on Mar 8, 2005 (gmt 0)

Thanks DanA,

For the bots I am tarpitting, they will have to wait if they want the whole page - they suck in the whole page, process its FORM elements, and then try to submit to that form. Hence, if they see a webpage starting with a FORM tag, they will wait and try to load the entire page. My PHP delay therefore must slow them down, or at least, they'll just give up after a predetermined amount of time.

Lord Majestic

3:42 pm on Mar 8, 2005 (gmt 0)

Timeouts are bot's specific, most primitive of them will use standard library's timeout that is likely to be anything from 120 to 300 seconds. Our bot uses 60 seconds timeout and if it detects a number of those "hard errors" then it will mark all further pages with same error.

Adding artificial delays to your pages is not a good idea because you will keep web server's thread or process waiting for end of transaction -- this is not efficient use of the webserver.

If you want to slow down efficiently legitimate bots then support Crawl-Delay parameter in robots.txt.

andymerrett

4:34 pm on Mar 8, 2005 (gmt 0)

Nice thought - but it's the illegitimate ones that I want to eliminate. I'm aware of the poor usage of server time, though from what I understand from the limited amount I have read, it won't so much put strain on the server, or bandwidth, but could be a DDOS risk if too many web connections are open at once.

The site it is running on is only modestly visited by legitimate users/bots, with additional but not resource threatening levels from illegitimate bots (thus far) so I am confident that I am not harming resources by implementing this method. robots.txt or other legitimate methods are never an option because these guys don't play by those rules.

I wouldn't use this technique on a high traffic site as a DOS could be more likely.