Welcome to WebmasterWorld Guest from 184.108.40.206
Forum Moderators: mademetop
John Young, the administrator of Cryptome, has a constant problem with rude robots. As many as 100 at a time have been blocked by him. But the real problem is that all of his files are static, and he's only able to block them manually. If he were to redesign his site for on-the-fly blocking, how would he go about this? Without on-the-fly, automated blocking, his server gets overloaded before he's able to take action. Moreover, he runs the site as a hobby, and is an architect, not a programmer. He shouldn't have to worry about rude robots at all.
Most of the participants in this forum work for commercial firms that are trying to generate MORE search engine activity. But if you are designing a site for a nonprofit that, for example, wants to digitize and place online about 25 years worth of dead-tree articles from its specialized publication, you can pretty well plan on having rude robot problems. This sort of archive is something to "kill for" if you share this specialized interest; there's no question that many rude surfers will find the tool they need to "click for" it as well, and just plow through the entire site with their broadband connection.
The entire site has to be designed from scratch to provide maximum flexibility, and this generally means that all the pages have to be generated dynamically.
You need a compiled "C" program for speed, memory efficiency, and lowest CPU load. You have to be faster than the fastest rude robot for effective blocking, and transparent to the bots you like, and you might be dealing with a mixture of all these simultaneously. This is no place for server-side Java or Perl scripts.
Beyond that, I'd be interested in hearing from anyone who has given this problem further thought. It seems to me that a specialized Apache module or entire httpd package that is optimized for this is something that might already have a market. And this market will get bigger, because the rude robot business is getting bigger. The robots.txt standard presumes unflinching courtesy from everyone. Those days are long gone.