Forum Moderators: open
NO spider follows your robots.txt file 100%.
You can contact Google about it though.
More information available at:
[google.com...]
So if it was sensitive content, you may be lucky and it may never show up in the serps.
If you contact them, be brief and clear in your message.
First, validate your robots.txt [searchengineworld.com]. Then send a description of your problem along with a sample of your access log file to googlebot@google.com.
HTH,
Jim
I don't think I've heard of any actual robots.txt bugs at Google in several months, but it's always possible. Glad you got a useful answer back quickly, uber_boy..
Rae
I'll wager you have the Google toolbar installed and enabled, and that you visited your new site. The other (remote) possibility is that your server log files are open and have been indexed. :o
Google will list any URL it becomes aware of in any way. However, it will not spider any page which you have disallowed in robots.txt. I've argued before that there is a question of semantics between "indexing" and "showing a link", but the robots exclusion standard does not say they can't show the link. So there it is, with no title or description - just a link.
If you really want to tell Google "Don't mention this page at all" you have to allow it to be spidered and place a meta robots noindex tag on the page itself.
BTW, Ask Jeeves/Teoma shows the same behaviour, but they're the only other one that I'm aware of.
Now that I know the fix, I can live with it.
Jim
Thanks for sending us the requested information.
To prevent the crawling of the disallowed directories, please make the following changes to your robots.txt file:
User-agent: *
Disallow: /foo/
Disallow: /bar/
Change To:
User-agent: *
Disallow: /foo
Disallow: /bar
Regards,
The Google Team
And as I've noted a couple of times, googlebot obeyed the original robots.txt file for the first week of the deep crawl.
uber_boy,
The following is a quote from A Standard for Robot Exclusion [robotstxt.org]. Note the second sentence of the quote:
Disallow
The value of this field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example, Disallow: /help disallows both /help.html and /help/index.html, whereas Disallow: /help/ would disallow /help/index.html but allow /help.html.
I hope you were getting an answer from a "Level 1" tech out there, because the answer looks wrong. Please keep your log files of the "incident", 'cause I suspect this will need some looking into...
Robots are supposed to use simple prefix-matching to determine which resources are off-limits. If you say, "Disallow: /myfiles/", then www.example.com/myfiles/whatever.html is off-limits, and www.example.com/myfiles.html is not.
Best,
Jim
Is your Disallow statement the last in your file and does it end with a new line? Some bots seem to insist on the new line (I don't know if the Google bots do), and not all robots.txt verification programs check this.
Regards,
R.