It's a much shorter list, because most bad spiders ignore robots.txt. Like the user-agents listed in the almost-perfect ban list, different webmasters may disagree with some entries, and others may want to add more.
Some entries are there only because I'm waiting to see if they will obey robots.txt, e.g. dumbdot. Others are either just an annoyance, or use my server bandwidth to make a profit and are not welcome. Again, my list is short because these are the only ones that obey (or may obey) robots.txt that I have seen.
In case anyone from one of the issuing organizations drops by... I insist that any robot which wishes to spider my site meet the following requirments: Strict conformance to robots.txt directives.
A proper user-agent string identifying the using organization, preferably with an info-page URL, but an e-mail address is OK.
If a public-domain robot, a proper and enforceable licensing agreement with all third-party users of the robot supporting the above requirements.
A demonstrable benefit for my visitors, my potential visitors, my site, or public-domain Web resources.
I have encouraged our friends at Nutch to especially heed item #3; Grub.org failed to do this, and is now unwelcome as a result.
No, I don't have delusions of grandeur, and my sites are not important, but the list above is just "proper nettiquette" for robots. I hope my few 403-Forbidden responses will be noticed, and my custom 403 Error page might be read, but I doubt it.
i agree completely with you on this... thsi is one of the
main reasons why i have spent so much time, over the years,
accumulating user agents and such and applying them to my
banned listings... i've been doing this via my apache configs
but i see now that it appears to be more beneficial to
utilize .htaccess for this purpose, if none other... at the
least, it keeps me from having to restart the server each
time i make an update...
If I may, since you're on the subject of A robots.txt ask this question? In need of some sort of answers.
If a site has a robots.txt and certain pages has disallow to be indexed and for the pages that does not get index, does the spider just takes the url and index that without meta information?
thanks in advance.
Google and AJ/Teoma do that - they list any URL they find anywhere. They are still in compliance with robots.txt, though, in that they do not fetch and index the page - They just list the link. Other SEs may do this too, but I'm not aware of any other major U.S. engines that behave this way.
In order to prevent the link from being listed, you have to allow Google and AJ/Teoma to fetch the page, but include the <meta name="robots" content="noindex"> meta-tag on the page itself. Inefficient, but it works.
Thanks for the information. This has answered my question.