Forum Moderators: goodroi

Message Too Old, No Replies

Robots INclusions standard?

         

mikomido

4:39 pm on Sep 8, 2007 (gmt 0)



Correct me if I'm wrong, but robots.txt only cares about EXCLUDING bots, correct? I want to exclude ALL bots by default and only allow those I want (Google and Yahoo only, really).

Impossible?

jdMorgan

5:11 pm on Sep 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The "exclusion" in the original Standard refers to files, not robots. You can do what you describe fairly easily:

# Disallow nothing for G and Y
User-Agent: Googlebot
User-Agent: Slurp
Disallow:

# Disallow everything for all others (who obey)
User-agent: *
Disallow: /


robots.txt uses prefix-matching. Therefore, no URL-path will ever match the Disallow in the first record above, because all local URL-paths start with slash, not <empty>. So G and Y will feel free to fetch everything.

In the second record, we disallow all local URL-paths starting with "/". Since all local URL-paths start with a slash, any robots other than G or Y will be instructed not to fetch anything from your server.

Note the blank line at the end. A blank line is required after each record in robots.txt. Some old and primitive robots will break if you leave it out after the last record.

Note also the multiple User-agent lines in the first record; While all of the major search engines recognize this construct, and it is part of the original Standard, again, some older and primitive robots won't recognize it. So if you add an old or second-tier robot to this first record, keep an eye on it to be sure if doesn't get confused. If it does, simply give it its own private record, adding it above the last record shown above.

And as hinted at by the note on the final record, robots.txt is a request for good robots to not fetch some or all of your pages. It has no 'enforcement' capability, and many very-dumb or malicious robots won't obey it. Many of them won't even fetch it. And further complicating things, robots from many on-line directories won't fetch it either; They figure that since you submitted your page to their directory, they have a right/duty to check that the URL is still valid, which is reasonable. You can easily detect these directory robots because they will only ask for the pages you submitted, and will not 'spider' your whole site like Googlebot or Slurp does.

A complex answer to a simple question... :)

Jim

mikomido

5:35 pm on Sep 8, 2007 (gmt 0)



Thanks for a great reply.

So the "User-agent" directive is using wildcards and case-insensitivity by default? "google" would match "The Googlebot 1.0", for instance?

Lord Majestic

5:38 pm on Sep 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



User-agent supports only one wildcard - * meaning all user-agents. Good bots will take normal strings and try to match them in their own user-agents, so specifying exact version numbers might actually be counter-productive, just use main keyword that identifies bots that you want to allow or disallow.

mikomido

5:51 pm on Sep 8, 2007 (gmt 0)



So basically, what you are really saying is that they DO work with wildcards and case-insensitivity?

jdMorgan

6:08 pm on Sep 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Yes.

pageoneresults

6:29 pm on Sep 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Note the blank line at the end. A blank line is required after each record in robots.txt. Some old and primitive robots will break if you leave it out after the last record.

jd, quick question, in the code example you posted, you have a blank line at the end of the file which I guess would be considered a blank line after the final record. Is that required? If so, I know of a very popular robots.txt validator that probably needs to flag that as an error?

jdMorgan

6:41 pm on Sep 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'd flag it as a warning -- I've only ever personally seen one European robot that blew up on that.

Jim

mikomido

6:43 pm on Sep 8, 2007 (gmt 0)



Is "Disallow: *" the same as "Disallow: /" or just nonsense?

Lord Majestic

6:47 pm on Sep 8, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Disallow: * is not a valid command - * can only be used in User-Agents and only as full line, not something like: Goo*

jdMorgan

12:16 am on Sep 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



More on this question:

> So the "User-agent" directive is using wildcards and case-insensitivity by default? "google" would match "The Googlebot 1.0", for instance?

Again, as described in the Standard for Robot Exclusion [robotstxt.org], robots.txt uses prefix-matching and is case-insensitive. So,

User-agent: google

would match "googlebot/1.1", "GoogleBot/99.9", and "googlegooglebot", but it would not match "The Googlebot 1.0" because "The Googlebot 1.0" does not start with "google", it starts with "Th.."

Jim

mikomido

12:26 am on Sep 9, 2007 (gmt 0)



I see. That is rather odd behaviour. It should match EXACT strings and allow you to put * before and/or after:

"Google"
"*Google"
"Google*"
"*Google*"

Lord Majestic

12:28 am on Sep 9, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Prefix case-insensitive matching should be used for URLs matched with strings present in Disallow: directives. For user-agents it is substring check that should be used: "The robot must obey the first record in /robots.txt that contains a User-Agent line whose value contains the name token of the robot as a substring. The name comparisons are case-insensitive." (this is from actual RFC which is more precise).

So in your example googlebot should match "The Googlebot 1.0".