homepage Welcome to WebmasterWorld Guest from 54.226.147.84
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
robots.txt
bad spider
copongcopong

10+ Year Member



 
Msg#: 8 posted 11:07 pm on Jun 1, 2003 (gmt 0)

Well we have a perfect .htaccess ban list at:

[webmasterworld.com...]

How about a disallow spider list?

 

jdMorgan

WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 8 posted 2:20 am on Jun 2, 2003 (gmt 0)

copongcopong,

It's a much shorter list, because most bad spiders ignore robots.txt. Like the user-agents listed in the almost-perfect ban list, different webmasters may disagree with some entries, and others may want to add more.

Here's mine:

User-agent: almaden
User-agent: ASPSeek
User-agent: baiduspider
User-agent: dumbBot
User-agent: Generic
User-agent: grub-client
User-agent: MSIECrawler
User-agent: NexaBot
User-agent: NPBot
User-agent: OWR_Crawler
User-agent: psbot
User-agent: rabaz
User-agent: RPT-HTTPClient
User-agent: ScoutAbout
User-agent: semanticdiscovery
User-agent: TurnitinBot
User-agent: Wget
Disallow: /

Some entries are there only because I'm waiting to see if they will obey robots.txt, e.g. dumbdot. Others are either just an annoyance, or use my server bandwidth to make a profit and are not welcome. Again, my list is short because these are the only ones that obey (or may obey) robots.txt that I have seen.

In case anyone from one of the issuing organizations drops by... I insist that any robot which wishes to spider my site meet the following requirments:

  • Strict conformance to robots.txt directives.
  • A proper user-agent string identifying the using organization, preferably with an info-page URL, but an e-mail address is OK.
  • If a public-domain robot, a proper and enforceable licensing agreement with all third-party users of the robot supporting the above requirements.
  • A demonstrable benefit for my visitors, my potential visitors, my site, or public-domain Web resources.

    I have encouraged our friends at Nutch to especially heed item #3; Grub.org failed to do this, and is now unwelcome as a result.

    No, I don't have delusions of grandeur, and my sites are not important, but the list above is just "proper nettiquette" for robots. I hope my few 403-Forbidden responses will be noticed, and my custom 403 Error page might be read, but I doubt it.

    Jim

  • wkitty42

    10+ Year Member



     
    Msg#: 8 posted 5:17 am on Jun 3, 2003 (gmt 0)

    jdM,

    i agree completely with you on this... thsi is one of the
    main reasons why i have spent so much time, over the years,
    accumulating user agents and such and applying them to my
    banned listings... i've been doing this via my apache configs
    but i see now that it appears to be more beneficial to
    utilize .htaccess for this purpose, if none other... at the
    least, it keeps me from having to restart the server each
    time i make an update...

    seonut



     
    Msg#: 8 posted 7:52 pm on Jun 4, 2003 (gmt 0)

    If I may, since you're on the subject of A robots.txt ask this question? In need of some sort of answers.

    If a site has a robots.txt and certain pages has disallow to be indexed and for the pages that does not get index, does the spider just takes the url and index that without meta information?

    thanks in advance.

    seonut

    jdMorgan

    WebmasterWorld Senior Member jdmorgan us a WebmasterWorld Top Contributor of All Time 10+ Year Member



     
    Msg#: 8 posted 8:32 pm on Jun 4, 2003 (gmt 0)

    seonut,

    Google and AJ/Teoma do that - they list any URL they find anywhere. They are still in compliance with robots.txt, though, in that they do not fetch and index the page - They just list the link. Other SEs may do this too, but I'm not aware of any other major U.S. engines that behave this way.

    In order to prevent the link from being listed, you have to allow Google and AJ/Teoma to fetch the page, but include the <meta name="robots" content="noindex"> meta-tag on the page itself. Inefficient, but it works.

    Jim

    seonut



     
    Msg#: 8 posted 9:11 pm on Jun 4, 2003 (gmt 0)

    Jim,

    Thanks for the information. This has answered my question.

    Thanks again

    seonut

    Global Options:
     top home search open messages active posts  
     

    Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved