Forum Moderators: goodroi
[webmasterworld.com...]
How about a disallow spider list?
It's a much shorter list, because most bad spiders ignore robots.txt. Like the user-agents listed in the almost-perfect ban list, different webmasters may disagree with some entries, and others may want to add more.
Here's mine:
User-agent: almaden
User-agent: ASPSeek
User-agent: baiduspider
User-agent: dumbBot
User-agent: Generic
User-agent: grub-client
User-agent: MSIECrawler
User-agent: NexaBot
User-agent: NPBot
User-agent: OWR_Crawler
User-agent: psbot
User-agent: rabaz
User-agent: RPT-HTTPClient
User-agent: ScoutAbout
User-agent: semanticdiscovery
User-agent: TurnitinBot
User-agent: Wget
Disallow: /
Some entries are there only because I'm waiting to see if they will obey robots.txt, e.g. dumbdot. Others are either just an annoyance, or use my server bandwidth to make a profit and are not welcome. Again, my list is short because these are the only ones that obey (or may obey) robots.txt that I have seen.
In case anyone from one of the issuing organizations drops by... I insist that any robot which wishes to spider my site meet the following requirments:
I have encouraged our friends at Nutch to especially heed item #3; Grub.org failed to do this, and is now unwelcome as a result.
No, I don't have delusions of grandeur, and my sites are not important, but the list above is just "proper nettiquette" for robots. I hope my few 403-Forbidden responses will be noticed, and my custom 403 Error page might be read, but I doubt it.
Jim
i agree completely with you on this... thsi is one of the
main reasons why i have spent so much time, over the years,
accumulating user agents and such and applying them to my
banned listings... i've been doing this via my apache configs
but i see now that it appears to be more beneficial to
utilize .htaccess for this purpose, if none other... at the
least, it keeps me from having to restart the server each
time i make an update...
If a site has a robots.txt and certain pages has disallow to be indexed and for the pages that does not get index, does the spider just takes the url and index that without meta information?
thanks in advance.
seonut
Google and AJ/Teoma do that - they list any URL they find anywhere. They are still in compliance with robots.txt, though, in that they do not fetch and index the page - They just list the link. Other SEs may do this too, but I'm not aware of any other major U.S. engines that behave this way.
In order to prevent the link from being listed, you have to allow Google and AJ/Teoma to fetch the page, but include the <meta name="robots" content="noindex"> meta-tag on the page itself. Inefficient, but it works.
Jim
Thanks for the information. This has answered my question.
Thanks again
seonut