Welcome to WebmasterWorld Guest from 220.127.116.11
My question is to how to stop them. Frankly speaking, I am interested in allowing only Googlebot Yahoo, and Ask Jeeves to crawl my site and rest others can go to hell. How do I set it up such that only googlebot, yahoo, and ask jeeves can crawl the website and if there is a large number of requests from the same ip or something like that, then it is blocked. Obviously, in my case I don't expect the surfer to visit more than 3 pages. So if it say around 20+ requests in a minute or by using some better algorith, the ip should be blocked.
I also understand that many downloader may masquerade themsevles as Googlebot and pass through. SO is there a great option to override this?
How can this be achieved?
You also forgot about all the scraper & exploit bots that will claim to be common web browsers. There are also a few "broken protocol" proxy and caching services/servers.
You want to accomplish a task that has no easy answer. There are a lot of trash bots on the Internet scraping content, harvesting email, scanning for exploits and driving inflated traffic reports. You make a few server adjustments and set aside time to deal with the most offensive sources.
Look at what Brett experienced when he wanted to vent his wrath on the misbehaving bots that were slamming WebmasterWorld. You have to see him tell the story. The exasperation mixed with extreme annoyance slowly creeps into his face and voice as he describes the options they tried and the aggravating results.
You could force cookies, but a lot of anti-spyware & anti-phishing software will automatically delete or refuse to accept a cookie.
You could use Flash and force users to load the plugin and navigate through Flash menus. But then you have to hope that visitors have Flash installed and that your Flash programming will be compatible with the largest variety of versions. You also have to create your Flash objects so Google will be able to "read" the text and navigate the site. A pure Flash site is evidence of pending failure for search results.
You could also setup a small sandbox script or hidden DIV that identifies rogue bots and redirects them or forces a 403. I've seen samples of scripts that capture the IP address of rogue bots and then builds a DENY list. The problem is that you will develop an astounding list of subnets in .htaccess that will eventually affect server performance.
One thing I implemented is flood control. If I get more hits from one URL than is sensible for a human, the control gets triggered.
More than is sensible is a varying value depending on how crucial / resource heavy the page is, and it includes varying trigger levels. One example might be:
triggerif: > 20 hits in 1 minute; or > 25 in 2 minutes; or > 500 in 1 hour.
When a control is triggered, it does various things, usually at random:
-- redirect user to a spam page
-- redirect user to 127.0.0.1
-- issue a 500 reply
-- issue no reply at all
-- return a page that says something like: "this page was generated by a search engine run by a spammer"
-- add the IP address to a permanent ban list.
The control remains triggered for a period (again, it varies). Typically, it is 30 minutes after the hit rate drops below the triggering threshold.
This deals with 95% of all bandwidth drains on some very busy sites.