Welcome to WebmasterWorld Guest from 54.167.22.37

Forum Moderators: goodroi

Message Too Old, No Replies

robot.txt syntax

     

jamesf4218

2:40 pm on Nov 18, 2002 (gmt 0)

10+ Year Member



Hello,

I'd like to get this syntax exactly right. I want to stop all search engine spiders visiting ONE directory on my website.

What's the syntax for this please? Also, where do I put the robot.txt file?

Cheers

James

korkus2000

2:45 pm on Nov 18, 2002 (gmt 0)

WebmasterWorld Senior Member korkus2000 is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Here is a great tutorial on robots.txt
[searchengineworld.com...]

This bars all robots fro the cgi-bin folder.

User-agent: *
Disallow: /cgi-bin/

Place it on the site's root.

You can also validate it at [searchengineworld.com...]

jamesf4218

2:50 pm on Nov 18, 2002 (gmt 0)

10+ Year Member



Thanks. Is this an acceptable alternative to an .htaccess box?

Andrew Thomas

3:06 pm on Nov 18, 2002 (gmt 0)

10+ Year Member



If i want my spider to search all my files and directories, is there any benefit from having a robots.txt file. And if so what is the format, as all the examples ive seen only show how to disallow.

thanx

korkus2000

3:09 pm on Nov 18, 2002 (gmt 0)

WebmasterWorld Senior Member korkus2000 is a WebmasterWorld Top Contributor of All Time 10+ Year Member



Andrew_Thomas it will stop all of your 404 requests. Also googleguy has said that a custom 404 page with no robots.txt file can confuse the spider and cause your site to not be spidered correctly.

syntax to allow all:
User-agent: *
Disallow:

jamesf4218 sorry I don't know about htaccess. I am a NT developer.

jdMorgan

4:01 pm on Nov 18, 2002 (gmt 0)

WebmasterWorld Senior Member jdmorgan is a WebmasterWorld Top Contributor of All Time 10+ Year Member



jamesf4218,

I want to stop all search engine spiders visiting ONE directory

Is [using robots.txt] an acceptable alternative to an .htaccess [block]?

Yes and no... A valid robots.txt will stop all spiders which request and obey robots.txt from requesting files from your Disallowed directories. However, there are two problems:

  • Some "bad=bots" don't request and/or don't obey robots.txt
  • Some search engines will list a link to a disallowed file if they find such a link

    In the former case, you may need to block these bad-bots using .htaccess, browscap.ini, a firewall, or whatever is available to you.

    In the latter case, even though the SE spider does not request the Disallowed file, it may still find the URL in links on your site, links on other sites, or even in a server log or collection of user bookmarks unintentionally left on-line somehwere. It will therefore list the URL without a title or description, but sometimes with the link text found on the page that links to it.

    I have previously argued that this flies in the face of the intent of the robots exclusion standard, if not its literal wording. However, it depends on whether you define the word "index" to mean, "fetch a page and parse it" or "include it in our index". Some SE's won't mention a file that's disallowed with robots.txt, but some will - So I've learned, "that's life, get over it, find a work-around, and move on..."

    About the only thing I know of to stop these search engines from listing the URL of a private page (without cloaking) is to link to your "private" pages only through another "linking page" that meets the following criteria:

  • The linking page must be Allowed in robots.txt
  • The linking page must contain a <meta name="robots" content="noindex,nofollow">
  • All accesses to the "private" files must go through this page.

    Then you have the problem of 'bots which don't read and respect robots.txt. These need to be blocked. I use a combination of .htaccess and an automatic bad-bot banning script [webmasterworld.com] that was posted here on WebmasterWorld by Key_Master. I have tweaked it to handle multiple simultaneous requests (by adding file-locking to it) and I'm "evaluating" it now to make sure the tweaks didn't break it. On low-to-moderate-traffic sites, it should work just fine as originally posted.

    HTH,
    Jim

  • SuzyUK

    9:48 am on Nov 25, 2002 (gmt 0)

    WebmasterWorld Senior Member suzyuk is a WebmasterWorld Top Contributor of All Time 10+ Year Member



    Hi..I'm just setting up a robots.txt file and have followed the tutorial and validated it so far

    however a(probably dumb) question

    what about those _vti_* directories and files? I don't use them but need them to stay on the server for other users that may follow

    am I right in thinking that they're private anyway or should I include them in the disallow?

    Suzy

     

    Featured Threads

    Hot Threads This Week

    Hot Threads This Month