homepage Welcome to WebmasterWorld Guest from 54.211.95.201
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Visit PubCon.com
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

    
"User-Agent"?
...or, just User-Agent ?
pendanticist




msg:1529061
 11:39 pm on Jan 21, 2003 (gmt 0)

Which is correct in your robots.txt?

Does it require the "" at the beginning and end of the term User-Agent?

Thusly? - "User-Agent"

or thusly? User-Agent

Pendanticist.

 

Dreamquick




msg:1529062
 11:48 pm on Jan 21, 2003 (gmt 0)

FWIW I'm running a robots.txt without quotes around the "user-agent" prefix and everything seems to be obeying me
e.g.

User-agent: ia_archiver
Disallow: /

- Tony

pendanticist




msg:1529063
 12:47 am on Jan 22, 2003 (gmt 0)

Yeah, that's what I thought too, until I ran my robots.txt thru Search Engine World's Robots.txt Validator [searchengineworld.com].

Pendanticist.

jdMorgan




msg:1529064
 1:11 am on Jan 22, 2003 (gmt 0)

Pendanticist,

Using Google and an off-limits directory called "keepout" as an example, the robots.txt record must be precisely, and with no changes in punctuation or capitalization:

User-agent: Googlebot
Disallow: /keepout/

A blank line is required between this record and any subsequent record specifying a different User-agent. A blank line is also required at the end of the file (by some robots, but not all).

It is sometimes interesting to go to well-know sites like cnn.com and have a look at their robots.txt - this is usually a good way to find a working example.

HTH,
Jim

pendanticist




msg:1529065
 1:35 am on Jan 22, 2003 (gmt 0)

I understand what you're saying, Jim. However, I only have one Disallow in there and, at the moment it is not "User-Agent" but just User-Agent.

That is where SEW's robots.txt tester finds an error by illustrating the need for "" marks - one on either side of User-Agent.

Thusly:

# Robots.txt file from http*//yadayada.com
#
# All robots will spider the domain

User-Agent:#########
Disallow:/path/

<shrug>

Pendanticist.

jdMorgan




msg:1529066
 2:01 am on Jan 22, 2003 (gmt 0)

That may be because "User-Agent" is incorrect. It must be "User-agent" - No capital "A" in "Agent."

Jim

pendanticist




msg:1529067
 2:03 am on Jan 22, 2003 (gmt 0)

Ok, with some assistance, I got it figured out.

The proverbial white space.

This is what it should be:

User-Agent: #########
Disallow:/path/

as opposed to:

User-Agent:#########
Disallow:/path/

I don't understand why the results noted the problem being the lack of double quotation marks surrounding "User-Agent" if a white space is the culprit.

Pendanticist.

bcc1234




msg:1529068
 2:06 am on Jan 22, 2003 (gmt 0)

That may be because "User-Agent" is incorrect. It must be "User-agent" - No capital "A" in "Agent."

Not sure about that. HTTP RFC states it as User-Agent and most software treats it the same regardless of the letter case.
I would assume robots.txt would rely on the RFC.

[w3.org...]

Brett_Tabke




msg:1529069
 2:25 am on Jan 22, 2003 (gmt 0)

He didn't have a space after the colon so the validator was trying to read the entire line. That of course, didn't match.

I too think that case shouldn't matter, but no where in the standard is that stated and all the examples use a lower case a for agent.

jdMorgan




msg:1529070
 2:33 am on Jan 22, 2003 (gmt 0)

The error message in the Search Engine World robots.txt checker is somewhat ambiguous. It is telling you to not capitalize the "A" in the word "Agent", not that you must quote the name of the user-agent.

The error message reads:
ERROR Field names are case sensitive. It should be written as: "User-agent" as written in the standard.

This may not be in compliance with the RFC, but it is what the robots.txt checker is calling for.

Note also that in the "listing" that the checker produces, it will "correct" this capitalization and show "User-agent: psbot" for example, even though the source file says, "User-Agent: psbot". It is not showing you your source file, but rather a modified version of it.

I was able to get a test version of robots.txt to pass with or without white space after User-agent:

White space after User-agent: and Disallow: is optional, but it does improve readability for humans.

The robots.txt checker script is not perfect, but it does help to eliminate most errors.

Jim

jdMorgan




msg:1529071
 2:36 am on Jan 22, 2003 (gmt 0)

Also, the reference cited above is for HTTP/1.1 headers, not the Standard for Robot Exclusion [robotstxt.org].

Jim
<edit> added link</edit>

jdMorgan




msg:1529072
 3:11 am on Jan 22, 2003 (gmt 0)

Brett,

He didn't have a space after the colon so the validator was trying to read the entire line. That of course, didn't match.

Since the space is optional in the standard, it might be better to parse the line(s) using the colon as the expected delimiter, since it is required. Of course, if the colon is missing, you're back in the same problem again - and continue to the next end-of-line.

Either way, the checker has saved me from several typo-induced disasters!

Thanks,
Jim

Brett_Tabke




msg:1529073
 3:22 am on Jan 22, 2003 (gmt 0)

><optionalspace>

Sure is - thanks JD. Never noticed that. Didn't spend enough time studying the problem there. (sorry about that pendanticist)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved