| 11:48 pm on Jan 21, 2003 (gmt 0)|
FWIW I'm running a robots.txt without quotes around the "user-agent" prefix and everything seems to be obeying me
|User-agent: ia_archiver |
| 12:47 am on Jan 22, 2003 (gmt 0)|
Yeah, that's what I thought too, until I ran my robots.txt thru Search Engine World's Robots.txt Validator [searchengineworld.com].
| 1:11 am on Jan 22, 2003 (gmt 0)|
Using Google and an off-limits directory called "keepout" as an example, the robots.txt record must be precisely, and with no changes in punctuation or capitalization:
A blank line is required between this record and any subsequent record specifying a different User-agent. A blank line is also required at the end of the file (by some robots, but not all).
It is sometimes interesting to go to well-know sites like cnn.com and have a look at their robots.txt - this is usually a good way to find a working example.
| 1:35 am on Jan 22, 2003 (gmt 0)|
I understand what you're saying, Jim. However, I only have one Disallow in there and, at the moment it is not "User-Agent" but just User-Agent.
That is where SEW's robots.txt tester finds an error by illustrating the need for "" marks - one on either side of User-Agent.
|# Robots.txt file from http*//yadayada.com |
# All robots will spider the domain
| 2:01 am on Jan 22, 2003 (gmt 0)|
That may be because "User-Agent" is incorrect. It must be "User-agent" - No capital "A" in "Agent."
| 2:03 am on Jan 22, 2003 (gmt 0)|
Ok, with some assistance, I got it figured out.
The proverbial white space.
This is what it should be:
|User-Agent: ######### |
as opposed to:
I don't understand why the results noted the problem being the lack of double quotation marks surrounding "User-Agent" if a white space is the culprit.
| 2:06 am on Jan 22, 2003 (gmt 0)|
|That may be because "User-Agent" is incorrect. It must be "User-agent" - No capital "A" in "Agent." |
Not sure about that. HTTP RFC states it as User-Agent and most software treats it the same regardless of the letter case.
I would assume robots.txt would rely on the RFC.
| 2:25 am on Jan 22, 2003 (gmt 0)|
He didn't have a space after the colon so the validator was trying to read the entire line. That of course, didn't match.
I too think that case shouldn't matter, but no where in the standard is that stated and all the examples use a lower case a for agent.
| 2:33 am on Jan 22, 2003 (gmt 0)|
The error message in the Search Engine World robots.txt checker is somewhat ambiguous. It is telling you to not capitalize the "A" in the word "Agent", not that you must quote the name of the user-agent.
The error message reads: ERROR Field names are case sensitive. It should be written as: "User-agent" as written in the standard.
This may not be in compliance with the RFC, but it is what the robots.txt checker is calling for.
Note also that in the "listing" that the checker produces, it will "correct" this capitalization and show "User-agent: psbot" for example, even though the source file says, "User-Agent: psbot". It is not showing you your source file, but rather a modified version of it.
I was able to get a test version of robots.txt to pass with or without white space after User-agent:
White space after User-agent: and Disallow: is optional, but it does improve readability for humans.
The robots.txt checker script is not perfect, but it does help to eliminate most errors.
| 2:36 am on Jan 22, 2003 (gmt 0)|
Also, the reference cited above is for HTTP/1.1 headers, not the Standard for Robot Exclusion [robotstxt.org].
<edit> added link</edit>
| 3:11 am on Jan 22, 2003 (gmt 0)|
|He didn't have a space after the colon so the validator was trying to read the entire line. That of course, didn't match. |
Since the space is optional in the standard, it might be better to parse the line(s) using the colon as the expected delimiter, since it is required. Of course, if the colon is missing, you're back in the same problem again - and continue to the next end-of-line.
Either way, the checker has saved me from several typo-induced disasters!
| 3:22 am on Jan 22, 2003 (gmt 0)|
Sure is - thanks JD. Never noticed that. Didn't spend enough time studying the problem there. (sorry about that pendanticist)