Welcome to WebmasterWorld Guest from 126.96.36.199
Forum Moderators: open
I've looked at the WebmasterWorld robots.txt [webmasterworld.com ] for inspiration/guidance and I'm puzzeled by some of the exclusions such as WebmasterWorld Extractor
Also this file appears to differ from the comprehensive robots.txt file that only allows known "nice guy" spiders on the tutorial page - robots4.txt [searchengineworld.com ]. This features some different agents like BlackWidow.
Any suggestions please on a list of "must exclude" agents for a robots.txt file.
Generally speaking, very few robots need to be Disallowed in your robots.txt, for the simple reason that most of the bad ones won't read or honor robots.txt.
As a result, different site owners will use differing criteria to determine which robots to exclude. But even the worst of the ones that do honor robots.txt are just a nuisance. They spider your site with a specific purpose - as stated by the owners - that has nothing to do with your site. So they're just wasting their time and your bandwidth.
To exclude the truly bad ones, you'll need to exclude them by user-agent or by IP address range. There is much on-going discussion of this in the Tracking and Logging and Search Engine Spider Identification forums. If you really want to absolutely avoid bad bots going through your site, you'll be disappointed - the best you can really do is block the known ones and try to trap the unknown ones. But it's never 100%.
Some of the following bots will obey robots.txt. However, there may also be copy-cats that use the same user-agent that will ignore these disallows; They need to be blocked with other methods.
Here's my current list, just for starters: