homepage Welcome to WebmasterWorld Guest from 107.21.187.131
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
robot.txt syntax
jamesf4218




msg:1527869
 2:40 pm on Nov 18, 2002 (gmt 0)

Hello,

I'd like to get this syntax exactly right. I want to stop all search engine spiders visiting ONE directory on my website.

What's the syntax for this please? Also, where do I put the robot.txt file?

Cheers

James

 

korkus2000




msg:1527870
 2:45 pm on Nov 18, 2002 (gmt 0)

Here is a great tutorial on robots.txt
[searchengineworld.com...]

This bars all robots fro the cgi-bin folder.

User-agent: *
Disallow: /cgi-bin/

Place it on the site's root.

You can also validate it at [searchengineworld.com...]

jamesf4218




msg:1527871
 2:50 pm on Nov 18, 2002 (gmt 0)

Thanks. Is this an acceptable alternative to an .htaccess box?

Andrew Thomas




msg:1527872
 3:06 pm on Nov 18, 2002 (gmt 0)

If i want my spider to search all my files and directories, is there any benefit from having a robots.txt file. And if so what is the format, as all the examples ive seen only show how to disallow.

thanx

korkus2000




msg:1527873
 3:09 pm on Nov 18, 2002 (gmt 0)

Andrew_Thomas it will stop all of your 404 requests. Also googleguy has said that a custom 404 page with no robots.txt file can confuse the spider and cause your site to not be spidered correctly.

syntax to allow all:
User-agent: *
Disallow:

jamesf4218 sorry I don't know about htaccess. I am a NT developer.

jdMorgan




msg:1527874
 4:01 pm on Nov 18, 2002 (gmt 0)

jamesf4218,

I want to stop all search engine spiders visiting ONE directory

Is [using robots.txt] an acceptable alternative to an .htaccess [block]?

Yes and no... A valid robots.txt will stop all spiders which request and obey robots.txt from requesting files from your Disallowed directories. However, there are two problems:

  • Some "bad=bots" don't request and/or don't obey robots.txt
  • Some search engines will list a link to a disallowed file if they find such a link

    In the former case, you may need to block these bad-bots using .htaccess, browscap.ini, a firewall, or whatever is available to you.

    In the latter case, even though the SE spider does not request the Disallowed file, it may still find the URL in links on your site, links on other sites, or even in a server log or collection of user bookmarks unintentionally left on-line somehwere. It will therefore list the URL without a title or description, but sometimes with the link text found on the page that links to it.

    I have previously argued that this flies in the face of the intent of the robots exclusion standard, if not its literal wording. However, it depends on whether you define the word "index" to mean, "fetch a page and parse it" or "include it in our index". Some SE's won't mention a file that's disallowed with robots.txt, but some will - So I've learned, "that's life, get over it, find a work-around, and move on..."

    About the only thing I know of to stop these search engines from listing the URL of a private page (without cloaking) is to link to your "private" pages only through another "linking page" that meets the following criteria:

  • The linking page must be Allowed in robots.txt
  • The linking page must contain a <meta name="robots" content="noindex,nofollow">
  • All accesses to the "private" files must go through this page.

    Then you have the problem of 'bots which don't read and respect robots.txt. These need to be blocked. I use a combination of .htaccess and an automatic bad-bot banning script [webmasterworld.com] that was posted here on WebmasterWorld by Key_Master. I have tweaked it to handle multiple simultaneous requests (by adding file-locking to it) and I'm "evaluating" it now to make sure the tweaks didn't break it. On low-to-moderate-traffic sites, it should work just fine as originally posted.

    HTH,
    Jim

  • SuzyUK




    msg:1527875
     9:48 am on Nov 25, 2002 (gmt 0)

    Hi..I'm just setting up a robots.txt file and have followed the tutorial and validated it so far

    however a(probably dumb) question

    what about those _vti_* directories and files? I don't use them but need them to stay on the server for other users that may follow

    am I right in thinking that they're private anyway or should I include them in the disallow?

    Suzy

    Global Options:
     top home search open messages active posts  
     

    Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
    rss feed

    All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
    Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
    WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
    © Webmaster World 1996-2014 all rights reserved