homepage Welcome to WebmasterWorld Guest from 54.227.67.210
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
Question about simple robots.txt file
nadsab




msg:1528793
 4:55 am on Apr 9, 2003 (gmt 0)

Hi,

For my robots.txt file, to exclude a page in my root, should it read:

User-agent: *
Disallow: page1.html

or...

User-agent: *
Disallow: /page1.html

Or does it matter?

Thanx

 

eaden




msg:1528794
 5:18 am on Apr 9, 2003 (gmt 0)

AFAIK there isn't a "only allow" robots.txt

so you have to disallow for each directory or file you don't want indexed. For example :

User-agent: *
Disallow: /cgi-bin/
Disallow: /images/
Disallow: /noindex.html

Will disallow /cgi-bin/, /images/ and noindex.html

nadsab




msg:1528795
 5:24 am on Apr 9, 2003 (gmt 0)

User-agent: *
Disallow: page1.html

or...

User-agent: *
Disallow: /page1.html

Or does it matter? Do I need a / before files if I'm excluding files in root web dir?

Also once this is placed on server, will previously indexed pages be eliminated from the google index once they are listed as disallow in robots.txt?

Thanx

DaveAtIFG




msg:1528796
 2:18 pm on Apr 9, 2003 (gmt 0)

From Brett's tutorial [searchengineworld.com],
There is a wildcard nature to the Disallow directive. The standard dictates that /bob would disallow /bob.html and /bob/indes.html (both the file bob and files in the bob directory will not be indexed).

Also, Robots.txt File Exclusion Standard and Format [searchengineworld.com].

nadsab




msg:1528797
 2:45 pm on Apr 9, 2003 (gmt 0)

Thanks DaveAtIFG,

So I don't have to use a / for individual files.

jdMorgan




msg:1528798
 3:09 pm on Apr 9, 2003 (gmt 0)

nadsab,

Yes, you do.

To disallow index.html in your web root directory:

Disallow: /index.html

Jim

nadsab




msg:1528799
 3:37 pm on Apr 9, 2003 (gmt 0)

From DaveAtIFG's above post - link, Brett's tutorial, this is what it says...

This one bans keeps googlebot from getting at the cheese.htm file:

User-agent: googlebot
Disallow: cheese.htm

Is the above from Brett's tutorial incorrect?

DaveAtIFG




msg:1528800
 9:53 pm on Apr 9, 2003 (gmt 0)

To block all robots from one file in the root, I use:
User-agent: *
Disallow: blockedfile.html

To block all robots from one file in a subdirectory:
User-agent: *
Disallow: /subdir/blockedfile.html

To block all robots from the files in an entire subdirectory:
User-agent: *
Disallow: /blockedsubdir/

Other techniques may work but I'm confident these work as advertised.

nadsab




msg:1528801
 10:12 pm on Apr 9, 2003 (gmt 0)

Thanks Dave!

jdMorgan




msg:1528802
 10:46 pm on Apr 9, 2003 (gmt 0)

User-agent: googlebot
Disallow: cheese.htm

>> Is the above from Brett's tutorial incorrect?

Yes - sorry - it's wrong. A Standard for Robot Exclusion [robotstxt.org]

Jim

nadsab




msg:1528803
 5:17 am on Apr 10, 2003 (gmt 0)

That brings me to another question. If I place those files as disallow in robots.txt, will google remove those pages that are already in the index, if I list them as dis allow? Or is there another tag for deleting files from the index?

jdMorgan




msg:1528804
 5:51 am on Apr 10, 2003 (gmt 0)

nadsab,

A sticky question, that one...

If Google finds a robots.txt Disallow for a page, it will remove the page's title and description from its search results. It will also no longer match search terms to the words on that page. So, the page essentially disappears from the Google search results pages. However, if Google finds a link to that page, it will still show that page in results when someone clicks on "More results from <this domain>".

I went around and around with this, trying to find a way to tell them "don't mention my contact forms pages at all, please", and here's what I ended up with:
For Google, don't Disallow the page in robots.txt, but place a <meta name="robots" content="noindex"> tag in the head section of the page itself.

You'll also need to do this for Ask Jeeves/Teoma as well; their handling of robots.txt is the same as Google's.
All the others seem to interpret a robots.txt Disallow as "don't mention this page at all." (I'm speaking of major U.S. search engines here - there may be other national and regional search engines which act like Google and AJ/T, but I am not aware of them.)

After reading the above, you may ask, "Well then, what good is robots.txt, if these search engines treat Disallows this way? Why not just use the robots metatag and forget robots.txt?"

The answer is that using robots.txt saves bandwidth. If a page is Disallowed in robots.txt, Google and AJ/T will list the page URL (with no title or description) if they find a link to it, but they will not download the page. On the other hand, in order to see the on-page robots metatag, a search engine *must* download the page. So using a robots.txt Disallow for those engines which treat it as "don't mention it" can save you a lot of bandwidth if the pages are large or spidered often because the site has high PR or link popularity. As a result, I have many pages which are disallowed for all engines except Google and AJ/T, and also are tagged with a meta name="robots" content="noindex,nofollow" specifically for Google and AJ/T.

I've tried to use my language carefully and specifically above - hopefully, this isn't too confusing...

Jim

nadsab




msg:1528805
 12:56 pm on Apr 10, 2003 (gmt 0)

Thanks JD,

Bandwidth is not a big issue so I will prob. use <meta name="robots" content="noindex"> instead. Oh well, wish I knew this before, haev to start all over.

Does <meta name="robots" content="noindex"> also work for most or all other engines besides google?

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved