"User-Agent"? - Sitemaps, Meta Data, and robots.txt forum at WebmasterWorld - WebmasterWorld

Forum Moderators: goodroi

Message Too Old, No Replies

"User-Agent"?

...or, just User-Agent ?

pendanticist

11:39 pm on Jan 21, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Which is correct in your robots.txt?

Does it require the "" at the beginning and end of the term User-Agent?

Thusly? - "User-Agent"

or thusly? User-Agent

Pendanticist.

Dreamquick

11:48 pm on Jan 21, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

FWIW I'm running a robots.txt without quotes around the "user-agent" prefix and everything seems to be obeying me
e.g.

User-agent: ia_archiver
Disallow: /

- Tony

pendanticist

12:47 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Yeah, that's what I thought too, until I ran my robots.txt thru Search Engine World's Robots.txt Validator [searchengineworld.com].

Pendanticist.

jdMorgan

1:11 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Pendanticist,

Using Google and an off-limits directory called "keepout" as an example, the robots.txt record must be precisely, and with no changes in punctuation or capitalization:

User-agent: Googlebot
Disallow: /keepout/

A blank line is required between this record and any subsequent record specifying a different User-agent. A blank line is also required at the end of the file (by some robots, but not all).

It is sometimes interesting to go to well-know sites like cnn.com and have a look at their robots.txt - this is usually a good way to find a working example.

HTH,
Jim

pendanticist

1:35 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

I understand what you're saying, Jim. However, I only have one Disallow in there and, at the moment it is not "User-Agent" but just User-Agent.

That is where SEW's robots.txt tester finds an error by illustrating the need for "" marks - one on either side of User-Agent.

Thusly:

# Robots.txt file from http*//yadayada.com
#
# All robots will spider the domain
User-Agent:#########
Disallow:/path/

<shrug>

Pendanticist.

jdMorgan

2:01 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

That may be because "User-Agent" is incorrect. It must be "User-agent" - No capital "A" in "Agent."

Jim

pendanticist

2:03 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Ok, with some assistance, I got it figured out.

The proverbial white space.

This is what it should be:

User-Agent: #########
Disallow:/path/

as opposed to:

User-Agent:#########
Disallow:/path/

I don't understand why the results noted the problem being the lack of double quotation marks surrounding "User-Agent" if a white space is the culprit.

Pendanticist.

bcc1234

2:06 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

That may be because "User-Agent" is incorrect. It must be "User-agent" - No capital "A" in "Agent."

Not sure about that. HTTP RFC states it as User-Agent and most software treats it the same regardless of the letter case.
I would assume robots.txt would rely on the RFC.

Brett_Tabke

2:25 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

He didn't have a space after the colon so the validator was trying to read the entire line. That of course, didn't match.

I too think that case shouldn't matter, but no where in the standard is that stated and all the examples use a lower case a for agent.

jdMorgan

2:33 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

The error message in the Search Engine World robots.txt checker is somewhat ambiguous. It is telling you to not capitalize the "A" in the word "Agent", not that you must quote the name of the user-agent.

The error message reads:

ERROR Field names are case sensitive. It should be written as: "User-agent" as written in the standard.

This may not be in compliance with the RFC, but it is what the robots.txt checker is calling for.

Note also that in the "listing" that the checker produces, it will "correct" this capitalization and show "User-agent: psbot" for example, even though the source file says, "User-Agent: psbot". It is not showing you your source file, but rather a modified version of it.

I was able to get a test version of robots.txt to pass with or without white space after User-agent:

White space after User-agent: and Disallow: is optional, but it does improve readability for humans.

The robots.txt checker script is not perfect, but it does help to eliminate most errors.

Jim

jdMorgan

2:36 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Also, the reference cited above is for HTTP/1.1 headers, not the Standard for Robot Exclusion [robotstxt.org].

Jim
<edit> added link</edit>

jdMorgan

3:11 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Senior Member

10+ Year Member

Brett,

He didn't have a space after the colon so the validator was trying to read the entire line. That of course, didn't match.

Since the space is optional in the standard, it might be better to parse the line(s) using the colon as the expected delimiter, since it is required. Of course, if the colon is missing, you're back in the same problem again - and continue to the next end-of-line.

Either way, the checker has saved me from several typo-induced disasters!

Thanks,
Jim

Brett_Tabke

3:22 am on Jan 22, 2003 (gmt 0)

WebmasterWorld Administrator

10+ Year Member

Top Contributors Of The Month

><optionalspace>

Sure is - thanks JD. Never noticed that. Didn't spend enough time studying the problem there. (sorry about that pendanticist)