Forum Moderators: phranque
I am interested to know how long an automated visitor - ie a web bot, spider etc - will try to access a web page before it gives up. Is this based on a server setting (I recall reading something about Apache web server setting up timeout value). Would it get exactly the same response as a human visitor (ie if I visit a web page that doesn't load and times out after 30 seconds, would a bot get the same response?)
I know this is extremely generalised, given the plethora of different bots, some legitimate and some not, but I am interested in the technology - not because I want to develop one but because I am combatting 'bad' bots.
My dream is that a bad bot will crash itself if it tries to access a page that exists (200 code) but takes too long to load (say, 11 days? :) What *might* happen in this case?
Any ideas - bots behaviour fascinates me, even when they are used unethically.
Would it get exactly the same response as a human visitor (ie if I visit a web page that doesn't load and times out after 30 seconds, would a bot get the same response?)
Not a pretty sight!
I guess the bots are sophisticated enough not to be 'crashed' by this - if a bot is looking for HTML FORMs to fill in, it isn't going to be crashed by non-existent or server error pages, if it even interprets the returning status code.
However, I would surmise that a bot can be programmed to retry to access the page more than once. I don't think a bot can be programmed to 'sit idle' waiting for a response, for days at a time - but that's not really an area I've explored.
There would be no way for the bot to avoid this time delay, so it would just timeout, either because it gives up, or because the server boots it? But would the server boot a connection if it is still serving a page to them?
For the bots I am tarpitting, they will have to wait if they want the whole page - they suck in the whole page, process its FORM elements, and then try to submit to that form. Hence, if they see a webpage starting with a FORM tag, they will wait and try to load the entire page. My PHP delay therefore must slow them down, or at least, they'll just give up after a predetermined amount of time.
Adding artificial delays to your pages is not a good idea because you will keep web server's thread or process waiting for end of transaction -- this is not efficient use of the webserver.
If you want to slow down efficiently legitimate bots then support Crawl-Delay parameter in robots.txt.
The site it is running on is only modestly visited by legitimate users/bots, with additional but not resource threatening levels from illegitimate bots (thus far) so I am confident that I am not harming resources by implementing this method. robots.txt or other legitimate methods are never an option because these guys don't play by those rules.
I wouldn't use this technique on a high traffic site as a DOS could be more likely.