Welcome to WebmasterWorld Guest from

Forum Moderators: goodroi

Message Too Old, No Replies


bad spider



11:07 pm on Jun 1, 2003 (gmt 0)

10+ Year Member

Well we have a perfect .htaccess ban list at:


How about a disallow spider list?


2:20 am on Jun 2, 2003 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member


It's a much shorter list, because most bad spiders ignore robots.txt. Like the user-agents listed in the almost-perfect ban list, different webmasters may disagree with some entries, and others may want to add more.

Here's mine:

User-agent: almaden
User-agent: ASPSeek
User-agent: baiduspider
User-agent: dumbBot
User-agent: Generic
User-agent: grub-client
User-agent: MSIECrawler
User-agent: NexaBot
User-agent: NPBot
User-agent: OWR_Crawler
User-agent: psbot
User-agent: rabaz
User-agent: RPT-HTTPClient
User-agent: ScoutAbout
User-agent: semanticdiscovery
User-agent: TurnitinBot
User-agent: Wget
Disallow: /

Some entries are there only because I'm waiting to see if they will obey robots.txt, e.g. dumbdot. Others are either just an annoyance, or use my server bandwidth to make a profit and are not welcome. Again, my list is short because these are the only ones that obey (or may obey) robots.txt that I have seen.

In case anyone from one of the issuing organizations drops by... I insist that any robot which wishes to spider my site meet the following requirments:

  • Strict conformance to robots.txt directives.
  • A proper user-agent string identifying the using organization, preferably with an info-page URL, but an e-mail address is OK.
  • If a public-domain robot, a proper and enforceable licensing agreement with all third-party users of the robot supporting the above requirements.
  • A demonstrable benefit for my visitors, my potential visitors, my site, or public-domain Web resources.

    I have encouraged our friends at Nutch to especially heed item #3; Grub.org failed to do this, and is now unwelcome as a result.

    No, I don't have delusions of grandeur, and my sites are not important, but the list above is just "proper nettiquette" for robots. I hope my few 403-Forbidden responses will be noticed, and my custom 403 Error page might be read, but I doubt it.


  • wkitty42

    5:17 am on Jun 3, 2003 (gmt 0)

    10+ Year Member


    i agree completely with you on this... thsi is one of the
    main reasons why i have spent so much time, over the years,
    accumulating user agents and such and applying them to my
    banned listings... i've been doing this via my apache configs
    but i see now that it appears to be more beneficial to
    utilize .htaccess for this purpose, if none other... at the
    least, it keeps me from having to restart the server each
    time i make an update...


    7:52 pm on Jun 4, 2003 (gmt 0)

    5+ Year Member

    If I may, since you're on the subject of A robots.txt ask this question? In need of some sort of answers.

    If a site has a robots.txt and certain pages has disallow to be indexed and for the pages that does not get index, does the spider just takes the url and index that without meta information?

    thanks in advance.



    8:32 pm on Jun 4, 2003 (gmt 0)

    WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member


    Google and AJ/Teoma do that - they list any URL they find anywhere. They are still in compliance with robots.txt, though, in that they do not fetch and index the page - They just list the link. Other SEs may do this too, but I'm not aware of any other major U.S. engines that behave this way.

    In order to prevent the link from being listed, you have to allow Google and AJ/Teoma to fetch the page, but include the <meta name="robots" content="noindex"> meta-tag on the page itself. Inefficient, but it works.



    9:11 pm on Jun 4, 2003 (gmt 0)

    5+ Year Member


    Thanks for the information. This has answered my question.

    Thanks again



    Featured Threads

    Hot Threads This Week

    Hot Threads This Month