Forum Moderators: goodroi
When it comes to a robots.txt file, I'm a bit of a newbie. Okay, a complete newbie. I've always used no-index tags for this purpose. Can I just run the syntax by you, to play safe?
Say I own googleguyiscool.com (hey, I'll try anything) but I don't want the following pages spidered 'cos they link to pages or sites which Google may disapprove of:
googleguyiscool.com/links.htm
googleguyiscool.com/wincash/index.htm
What exactly would the robots.txt file look like? Should I then just FTP to my server's main directory as normal? Any need to change permissions for the file?
Many thanks.
You can use Brett's Robots.txt Validator [searchengineworld.com] to check it.
Just upload it then check it with your browser. If you can read it then it should be fine.
I have a line somewhat smilar to the following in my robots.txt file:
User-Agent: *
Disallow: /resellers/
Next thing you know, Googlebot is messing around indexing a file like /resellers/westcoast/reseller1.html
I'm a little worried since the above page has duplicate content to another page on our site only minus our logo.
We've already had PR0 for four months, I hope this duplicate content doesn't really put us in the slammer permanent-like.
Did I screw up my robots.txt file? Why does Google still index that stuff?
I might be wrong but I think you might have made a mistake in the robots.txt
A quote from this page: [robotstxt.org...]
For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html
To me that means if you want to Disallow everything in the resellers folder, you should leave off the trailing / like this:
User-Agent: *
Disallow: /resellers
Like I said, could be wrong but that is the way I do it and it works for me.
Disallow: /help/
will disallow all files in the /help/ directory but allow help.htm.
Disallow: /help
will disallow the help.html and /help/index.html but allow the contents of the /help/directory to be indexed.
My understanding is if you want the entire directory to be off limits, then it would look like this...
Disallow: /help/
Which then tells the spider not to index anything that contains domain.com/help/.
Please, for those of you who are experts on the robots.txt file, can you clarify this for us?
I believe the confusion here is shared in the search engine world, too. On the Microsoft website they tell people to leave off the trailing /, but all other robots.txt info I've seen else where, include the trailing /. The only way I know of to be certain is to list all directories, BOTH ways in your robots.txt file.
user-agent *
/help
/help/
/help should match /help, /help/, /help/index.html, /help/whatever, /helpful, /helpful.html. and /helpful/whatever
Of those /help/ should match only /help/, /help/index.html, /help/whatever.
I have wondered what happens when Google sees a link to /help and /help/ is robots.txt restricted. It ought to refuse to follow the 302 redirect from /help to /help/ (assuming that it's just a normal directory style URL).
Even this bit?
Disallow: /helpwill disallow the help.html and /help/index.html but allow the contents of the /help/directory to be indexed.
The last part of that doesn't fit in with my understanding, "...any URL that starts with this value will not be retrieved...".
I don't want to barrage Mr Koster (who's done the Web a real favour, IMO) so Son_House, could you follow that up with him seeing as you've been in touch already?
Disallow: /help
will disallow the help.html and /help/index.html
but allow the contents of the /help/directory to be indexed.
Because there is no trailing slash after /help, the robot needs to resolve the rest of the address. In this instance, it will first resolve to the .html (.asp, .cgi, etc.) file and then it will look for the /help directory and resolve to the index.html (.asp, .cgi, etc.) of that directory but nothing else.
Without the trailing forward slash, there are only two possibilities for the spider; a help.html and a help/index.html.
I'm assuming that the addition of the trailing forward slash now tells the spider to stay away from all URL's that contain /help/ and it doesn't have to guess the syntax. Without that trailing forward slash, it only has two options.
I see where you are coming from. What is the difference between /help and /help/? If you were pointing a URL that ended with /help or /help/ both would resolve to the index.html.
<edited>I updated this to reflect ciml comments below.
(edited by: pageoneresults at 5:09 pm (utc) on April 10, 2002)
> If you were pointing a URL that ended with /help or /help/ both would resolve to the index.html.
Robots can't see the underlying directory structure. /help usually gives a 302 redirect to /help/, /help/ sometimes is internally rewritten to /help/index.html, but it can just as happily be /help/index.cgi, /help/default.asp or /help/whatever.whatever
In my view, no robot has the right to guess what aliases a URL has without using HTTP header or content inspection, and robots.txt matching is just based on the first so many characters in a URL string.
You may also specify directories:Disallow: /cgi-bin/
Which would block spiders from your cgi-bin directory.
There is a wildcard nature to the Disallow directive. The standard dictates that /bob would disallow /bob.html and /bob/index.html (both the file bob and files in the bob directory will not be indexed).
The part that reads (both the file bob and files...) tells me that the /bob and /bob/ are one in the same due to the wildcard nature of the Disallow directive. ?
"Disallow: /bob/" covers everything covered by "Disallow: /bob" except for the URL /bob with nothing after it (ignoring :portnumber).
Static sites with standard file naming conventions, on the default configuration of any mainstream browser would only ever return 302 (moved) or 404 (not found) for that, but there's no reason not to have URLs like /bob
(And people call me a pedant?)
Disallow: /help
Will disallow help.html and /help/index.html
For the question ciml asked:
but allow the contents of the /help/directory to be indexed.
The answer is no, all files in that directory and any subdirectories are disallowed. The robot simply does a prefix match. So e.g. /help, /help/index.html, /helper, /helper/foo.html, /help/bar.html all start with "/help", which is disallowed.
I used the word wildcard in explaining our thoughts to him and he feels that suggests to people that they can do this: Disallow: /help* which is incorrect.
So if I understand this, if I wanted to block off a directory, I could use Disallow: /help or Disallow: /help/
Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved.
btw: anyone have in last critcal comments about the robots.txt validator, I would appreciate a email/stickymail about it. I'm about to make it the final version, and I believe it will be released as open source software (with a link back requirement).
04.18.2002 - robots.txt Validator Update [webmasterworld.com]