I’ve read the Wheels thread and spend a night reading the website of incredibill. [
webmasterworld.com...]
Then I started programming but found out it was more difficult than I thought. I’m checking everyone who access robots.txt and set up some hidden links(not yet blocked in robots.txt) to see who is crawling my website. After a couple of hour’s I’ve got more than 50 bots already, some of them of which I don’t know if they are really bots (can’t think of a way how to access certain hidden links unless you are not a bot?).
Incredibill recommended to use a whitelist to give access to bots, but how would you implement that? Do you give access to only these bots you have in your whitelist and block other bots by default?
And how would you block the unwanted bots without using too much resources? If I would all ip’s in the database of unwanted bot’s this list will grow massively at this pace. The same if I would add them to .htaccess
The only way to do it effectively would be to detect “bot behaviour “ and block by that. But I’ve searched the internet to find the “best way” but couldn’t find anything satisfying to start with.
I would like to know an efficient way to detect how fast a crawler is accessing my website, you do this in a database or in sessions or…? I did find some code snippets in other threads but none of them where working “out of the box” to test them en alter them to use for my website. ( [
webmasterworld.com...] )
Anyone who can help?