Block all bots, except Googlebot and a few others

Forum Moderators: phranque

Message Too Old, No Replies

Block all bots, except Googlebot and a few others

by using smart algorithm...

Imaster

7:05 am on May 31, 2007 (gmt 0)

I manage a large website which is often hit with spammy bots / offline downloaders trying to download the whole website. This is resulting in lot of bandwidth consumption and extra load because these bots don't obey any rules.

My question is to how to stop them. Frankly speaking, I am interested in allowing only Googlebot Yahoo, and Ask Jeeves to crawl my site and rest others can go to hell. How do I set it up such that only googlebot, yahoo, and ask jeeves can crawl the website and if there is a large number of requests from the same ip or something like that, then it is blocked. Obviously, in my case I don't expect the surfer to visit more than 3 pages. So if it say around 20+ requests in a minute or by using some better algorith, the ip should be blocked.

I also understand that many downloader may masquerade themsevles as Googlebot and pass through. SO is there a great option to override this?

How can this be achieved?

TXGodzilla

7:43 am on May 31, 2007 (gmt 0)

What? You don't want MSNbot to crawl the site? ;-)

You also forgot about all the scraper & exploit bots that will claim to be common web browsers. There are also a few "broken protocol" proxy and caching services/servers.

You want to accomplish a task that has no easy answer. There are a lot of trash bots on the Internet scraping content, harvesting email, scanning for exploits and driving inflated traffic reports. You make a few server adjustments and set aside time to deal with the most offensive sources.

Look at what Brett experienced when he wanted to vent his wrath on the misbehaving bots that were slamming WebmasterWorld. You have to see him tell the story. The exasperation mixed with extreme annoyance slowly creeps into his face and voice as he describes the options they tried and the aggravating results.

You could force cookies, but a lot of anti-spyware & anti-phishing software will automatically delete or refuse to accept a cookie.

You could use javascripts but there are bots out there that can navigate javascripts. You also have script blockers on the more paranoid computers.

You could use Flash and force users to load the plugin and navigate through Flash menus. But then you have to hope that visitors have Flash installed and that your Flash programming will be compatible with the largest variety of versions. You also have to create your Flash objects so Google will be able to "read" the text and navigate the site. A pure Flash site is evidence of pending failure for search results.

You could also setup a small sandbox script or hidden DIV that identifies rogue bots and redirects them or forces a 403. I've seen samples of scripts that capture the IP address of rogue bots and then builds a DENY list. The problem is that you will develop an astounding list of subnets in .htaccess that will eventually affect server performance.

victor

9:31 am on May 31, 2007 (gmt 0)

As TXG suggests, there is no one silver bullet. Just a lot of things you could try.

One thing I implemented is flood control. If I get more hits from one URL than is sensible for a human, the control gets triggered.

More than is sensible is a varying value depending on how crucial / resource heavy the page is, and it includes varying trigger levels. One example might be:

triggerif: > 20 hits in 1 minute; or > 25 in 2 minutes; or > 500 in 1 hour.

When a control is triggered, it does various things, usually at random:
-- redirect user to a spam page
-- redirect user to 127.0.0.1
-- issue a 500 reply
-- issue no reply at all
-- return a page that says something like: "this page was generated by a search engine run by a spammer"
-- add the IP address to a permanent ban list.

The control remains triggered for a period (again, it varies). Typically, it is 30 minutes after the hit rate drops below the triggering threshold.

This deals with 95% of all bandwidth drains on some very busy sites.