Forum Moderators: goodroi
Pendanticist.
Using Google and an off-limits directory called "keepout" as an example, the robots.txt record must be precisely, and with no changes in punctuation or capitalization:
User-agent: Googlebot
Disallow: /keepout/
A blank line is required between this record and any subsequent record specifying a different User-agent. A blank line is also required at the end of the file (by some robots, but not all).
It is sometimes interesting to go to well-know sites like cnn.com and have a look at their robots.txt - this is usually a good way to find a working example.
HTH,
Jim
That is where SEW's robots.txt tester finds an error by illustrating the need for "" marks - one on either side of User-Agent.
Thusly:
# Robots.txt file from http*//yadayada.com
#
# All robots will spider the domainUser-Agent:#########
Disallow:/path/
<shrug>
Pendanticist.
The proverbial white space.
This is what it should be:
User-Agent: #########
Disallow:/path/
as opposed to:
User-Agent:#########
Disallow:/path/
I don't understand why the results noted the problem being the lack of double quotation marks surrounding "User-Agent" if a white space is the culprit.
Pendanticist.
That may be because "User-Agent" is incorrect. It must be "User-agent" - No capital "A" in "Agent."
Not sure about that. HTTP RFC states it as User-Agent and most software treats it the same regardless of the letter case.
I would assume robots.txt would rely on the RFC.
[w3.org...]
The error message reads:
ERROR Field names are case sensitive. It should be written as: "User-agent" as written in the standard.
This may not be in compliance with the RFC, but it is what the robots.txt checker is calling for.
Note also that in the "listing" that the checker produces, it will "correct" this capitalization and show "User-agent: psbot" for example, even though the source file says, "User-Agent: psbot". It is not showing you your source file, but rather a modified version of it.
I was able to get a test version of robots.txt to pass with or without white space after User-agent:
White space after User-agent: and Disallow: is optional, but it does improve readability for humans.
The robots.txt checker script is not perfect, but it does help to eliminate most errors.
Jim
Jim
<edit> added link</edit>
He didn't have a space after the colon so the validator was trying to read the entire line. That of course, didn't match.
Since the space is optional in the standard, it might be better to parse the line(s) using the colon as the expected delimiter, since it is required. Of course, if the colon is missing, you're back in the same problem again - and continue to the next end-of-line.
Either way, the checker has saved me from several typo-induced disasters!
Thanks,
Jim