What is robots.txt syntax

Forum Moderators: goodroi

Message Too Old, No Replies

What is robots.txt syntax

Alec Doggone

6:53 pm on Apr 8, 2002 (gmt 0)

Thanks for the feedback, guys.

When it comes to a robots.txt file, I'm a bit of a newbie. Okay, a complete newbie. I've always used no-index tags for this purpose. Can I just run the syntax by you, to play safe?

Say I own googleguyiscool.com (hey, I'll try anything) but I don't want the following pages spidered 'cos they link to pages or sites which Google may disapprove of:

googleguyiscool.com/links.htm
googleguyiscool.com/wincash/index.htm

What exactly would the robots.txt file look like? Should I then just FTP to my server's main directory as normal? Any need to change permissions for the file?

Many thanks.

ciml

7:08 pm on Apr 8, 2002 (gmt 0)

User-agent: psbot
Disallow: /links.htm
Disallow: /wincash

You can use Brett's Robots.txt Validator [searchengineworld.com] to check it.

Just upload it then check it with your browser. If you can read it then it should be fine.

Alec Doggone

8:26 pm on Apr 8, 2002 (gmt 0)

Thanks for that, ciml.

Just checking: you've got 'user agent' as psbot. Looking at Google's own info, they quote 'User-Agent: Googlebot' ?

sfraga

8:47 pm on Apr 8, 2002 (gmt 0)

I've also had problems with Google indexing stuff I didn't want it to.

I have a line somewhat smilar to the following in my robots.txt file:

User-Agent: *
Disallow: /resellers/

Next thing you know, Googlebot is messing around indexing a file like /resellers/westcoast/reseller1.html

I'm a little worried since the above page has duplicate content to another page on our site only minus our logo.

We've already had PR0 for four months, I hope this duplicate content doesn't really put us in the slammer permanent-like.

Did I screw up my robots.txt file? Why does Google still index that stuff?

Son_House

7:58 pm on Apr 9, 2002 (gmt 0)

sfraga

I might be wrong but I think you might have made a mistake in the robots.txt

A quote from this page: [robotstxt.org...]

For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html

To me that means if you want to Disallow everything in the resellers folder, you should leave off the trailing / like this:

User-Agent: *
Disallow: /resellers

Like I said, could be wrong but that is the way I do it and it works for me.

pageoneresults

8:11 pm on Apr 9, 2002 (gmt 0)

Hmmm, I've been waiting for a reply to this thread such as that posted by Son_House. I put a lot of research into the robots.txt file a couple of years ago and this is how I understand it in laymans terms...

Disallow: /help/

will disallow all files in the /help/ directory but allow help.htm.

Disallow: /help

will disallow the help.html and /help/index.html but allow the contents of the /help/directory to be indexed.

My understanding is if you want the entire directory to be off limits, then it would look like this...

Disallow: /help/

Which then tells the spider not to index anything that contains domain.com/help/.

Please, for those of you who are experts on the robots.txt file, can you clarify this for us?

MarkHutch

8:22 pm on Apr 9, 2002 (gmt 0)

>>Please, for those of you who are experts on the robots.txt file, can you clarify this for us?<<

I believe the confusion here is shared in the search engine world, too. On the Microsoft website they tell people to leave off the trailing /, but all other robots.txt info I've seen else where, include the trailing /. The only way I know of to be certain is to list all directories, BOTH ways in your robots.txt file.

user-agent *
/help
/help/

ciml

9:37 pm on Apr 9, 2002 (gmt 0)

I'm sorry Alec. I copied and pasted without reading what I was doing. It should have been "User-agent: Googlebot" or "User-agent: *".

/help should match /help, /help/, /help/index.html, /help/whatever, /helpful, /helpful.html. and /helpful/whatever

Of those /help/ should match only /help/, /help/index.html, /help/whatever.

I have wondered what happens when Google sees a link to /help and /help/ is robots.txt restricted. It ought to refuse to follow the 302 redirect from /help to /help/ (assuming that it's just a normal directory style URL).

Son_House

1:37 am on Apr 10, 2002 (gmt 0)

Well it turns out that I was wrong. I asked Martijn Koster (author of robotstxt.org) about this and he said that pageoneresults has it right.

Thank you for your help Mr. Koster :)

pageoneresults

2:58 am on Apr 10, 2002 (gmt 0)

Son_House, thank you very much for confirming that. I've had that information posted on my site for quite some time and that page ranks fairly well for the term robots text file. I would sure hate to be giving my visitors wrong information, ouch! Again, thanks!

Son_House

5:37 am on Apr 10, 2002 (gmt 0)

pageoneresults, you are very welcome :) I'm glad I did it also because as MarkHutch, I have also seen it both ways. Now I know that I have been doing it wrong.

pageoneresults

5:41 am on Apr 10, 2002 (gmt 0)

I'm hoping Brett will stop in and post a comment on this too. I know he has done some amazing things with his robots.txt file and a second confirmation from an industry expert would be much appreciated!

ciml

4:28 pm on Apr 10, 2002 (gmt 0)

Son_House:
> I asked Martijn Koster (author of robotstxt.org) about this and he said that pageoneresults has it right.

Even this bit?

Disallow: /help
will disallow the help.html and /help/index.html but allow the contents of the /help/directory to be indexed.

The last part of that doesn't fit in with my understanding, "...any URL that starts with this value will not be retrieved...".

I don't want to barrage Mr Koster (who's done the Web a real favour, IMO) so Son_House, could you follow that up with him seeing as you've been in touch already?

pageoneresults

4:39 pm on Apr 10, 2002 (gmt 0)

Me too! Although I think I understand what is happening.

Disallow: /help

will disallow the help.html and /help/index.html
but allow the contents of the /help/directory to be indexed.

Because there is no trailing slash after /help, the robot needs to resolve the rest of the address. In this instance, it will first resolve to the .html (.asp, .cgi, etc.) file and then it will look for the /help directory and resolve to the index.html (.asp, .cgi, etc.) of that directory but nothing else.

Without the trailing forward slash, there are only two possibilities for the spider; a help.html and a help/index.html.

I'm assuming that the addition of the trailing forward slash now tells the spider to stay away from all URL's that contain /help/ and it doesn't have to guess the syntax. Without that trailing forward slash, it only has two options.

I see where you are coming from. What is the difference between /help and /help/? If you were pointing a URL that ended with /help or /help/ both would resolve to the index.html.

<edited>I updated this to reflect ciml comments below.

(edited by: pageoneresults at 5:09 pm (utc) on April 10, 2002)

ciml

5:06 pm on Apr 10, 2002 (gmt 0)

I understand what you mean pageone, but if I use * as a wildcard then /help = /help* and /help/ = /help/* (fellow Perl people replace * with .*)

> If you were pointing a URL that ended with /help or /help/ both would resolve to the index.html.

Robots can't see the underlying directory structure. /help usually gives a 302 redirect to /help/, /help/ sometimes is internally rewritten to /help/index.html, but it can just as happily be /help/index.cgi, /help/default.asp or /help/whatever.whatever

In my view, no robot has the right to guess what aliases a URL has without using HTTP header or content inspection, and robots.txt matching is just based on the first so many characters in a URL string.

Son_House

6:37 am on Apr 11, 2002 (gmt 0)

I have not heard back from him yet.

In the first reply he said that Disallow: /help would also disallow /helper.html and that Disallow: /help/ is the right way for doing only directories.

pageoneresults

7:32 am on Apr 11, 2002 (gmt 0)

Son_House, I've been doing some more reading and I think there are two options here that serve the same purpose. This is an excerpt from Brett's robots.txt Tutorial...

You may also specify directories:
Disallow: /cgi-bin/
Which would block spiders from your cgi-bin directory.
There is a wildcard nature to the Disallow directive. The standard dictates that /bob would disallow /bob.html and /bob/index.html (both the file bob and files in the bob directory will not be indexed).

The part that reads (both the file bob and files...) tells me that the /bob and /bob/ are one in the same due to the wildcard nature of the Disallow directive. ?

ciml

9:28 am on Apr 11, 2002 (gmt 0)

Sorry pageone, I still don't quite agree (very nearly, though).

"Disallow: /bob/" covers everything covered by "Disallow: /bob" except for the URL /bob with nothing after it (ignoring :portnumber).

Static sites with standard file naming conventions, on the default configuration of any mainstream browser would only ever return 302 (moved) or 404 (not found) for that, but there's no reason not to have URLs like /bob

(And people call me a pedant?)

Son_House

3:47 am on Apr 12, 2002 (gmt 0)

Here is how it was explained to me:

Disallow: /help

Will disallow help.html and /help/index.html

For the question ciml asked:

but allow the contents of the /help/directory to be indexed.

The answer is no, all files in that directory and any subdirectories are disallowed. The robot simply does a prefix match. So e.g. /help, /help/index.html, /helper, /helper/foo.html, /help/bar.html all start with "/help", which is disallowed.

I used the word wildcard in explaining our thoughts to him and he feels that suggests to people that they can do this: Disallow: /help* which is incorrect.

So if I understand this, if I wanted to block off a directory, I could use Disallow: /help or Disallow: /help/

ciml

9:30 am on Apr 12, 2002 (gmt 0)

Thanks Son_House, that's reassuring.

Brett_Tabke

1:03 pm on Apr 12, 2002 (gmt 0)

I think you guys got it figured right. It is a confusing part of the standard. The critical part is under the disallow field:

Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved.

btw: anyone have in last critcal comments about the robots.txt validator, I would appreciate a email/stickymail about it. I'm about to make it the final version, and I believe it will be released as open source software (with a link back requirement).

pageoneresults

9:07 am on Apr 19, 2002 (gmt 0)

Ah-ha, I think Brett may have confirmed our question here about the trailing slashes...

04.18.2002 - robots.txt Validator Update [webmasterworld.com]