Forum Moderators: goodroi
Impossible?
# Disallow nothing for G and Y
User-Agent: Googlebot
User-Agent: Slurp
Disallow:# Disallow everything for all others (who obey)
User-agent: *
Disallow: /
In the second record, we disallow all local URL-paths starting with "/". Since all local URL-paths start with a slash, any robots other than G or Y will be instructed not to fetch anything from your server.
Note the blank line at the end. A blank line is required after each record in robots.txt. Some old and primitive robots will break if you leave it out after the last record.
Note also the multiple User-agent lines in the first record; While all of the major search engines recognize this construct, and it is part of the original Standard, again, some older and primitive robots won't recognize it. So if you add an old or second-tier robot to this first record, keep an eye on it to be sure if doesn't get confused. If it does, simply give it its own private record, adding it above the last record shown above.
And as hinted at by the note on the final record, robots.txt is a request for good robots to not fetch some or all of your pages. It has no 'enforcement' capability, and many very-dumb or malicious robots won't obey it. Many of them won't even fetch it. And further complicating things, robots from many on-line directories won't fetch it either; They figure that since you submitted your page to their directory, they have a right/duty to check that the URL is still valid, which is reasonable. You can easily detect these directory robots because they will only ask for the pages you submitted, and will not 'spider' your whole site like Googlebot or Slurp does.
A complex answer to a simple question... :)
Jim
So the "User-agent" directive is using wildcards and case-insensitivity by default? "google" would match "The Googlebot 1.0", for instance?
Note the blank line at the end. A blank line is required after each record in robots.txt. Some old and primitive robots will break if you leave it out after the last record.
jd, quick question, in the code example you posted, you have a blank line at the end of the file which I guess would be considered a blank line after the final record. Is that required? If so, I know of a very popular robots.txt validator that probably needs to flag that as an error?
> So the "User-agent" directive is using wildcards and case-insensitivity by default? "google" would match "The Googlebot 1.0", for instance?
Again, as described in the Standard for Robot Exclusion [robotstxt.org], robots.txt uses prefix-matching and is case-insensitive. So,
User-agent: google
would match "googlebot/1.1", "GoogleBot/99.9", and "googlegooglebot", but it would not match "The Googlebot 1.0" because "The Googlebot 1.0" does not start with "google", it starts with "Th.."
Jim
"Google"
"*Google"
"Google*"
"*Google*"
So in your example googlebot should match "The Googlebot 1.0".