homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
Forum Library, Charter, Moderators: goodroi

Sitemaps, Meta Data, and robots.txt Forum

what is the User-Agent?

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

Msg#: 4557887 posted 1:06 am on Mar 24, 2013 (gmt 0)

That was a serious question. Next door in Apache I would not need to ask because the answer is straightforward: the "User Agent" is any substring of the full user-agent string, with further constraints --anchors, case-sensitivity, lookarounds etc. --at the site owner's discretion. If you meet an unanchored rule that says "like Firefox 3.5", it's no use arguing that your UA string says "NOT like Firefox 3.5".

But what about robots.txt? It's all voluntary, so the robot is welcome to say "Oh, sorry, didn't know you meant me, nobody ever calls me Rob".

Assuming for the sake of discussion that your visiting robot wants to follow robots.txt: What's it called? They don't wear name tags, so all you've got is the full UA string.

I found this useful page [developers.google.com] at google. It's really https, but it won't link that way, even in "url=" tags. Other neighboring pages supply additional information such as
The directives listed in the robots.txt file apply only to the host, protocol and port number where the file is hosted.

(psst! Helen! weren't you just asking about this?) and
The guidelines set forth in this document are followed by all automated crawlers at Google. When an agent accesses URLs on behalf of a user (for example, for translation, manually subscribed feeds, malware analysis, etc), these guidelines do not need to apply.

(Nice to see that in print: don't remember meeting it before.)

There's a similar list at Yandex [help.yandex.com].

But I'm ### if I can find equivalent information for No.-2-We-Try-Harder. The URL given in their UA string leads you around in circles, ending up with a down-the-drain sort of spiral. If you say "msnbot" will it cover msnbot-media? Will it cover the bingbot on general "we know what you mean" principles-- or nobody? What if you spell it MSNbot?

I tossed that out as an example, but the underlying question applies to any well-intentioned robot.



WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

Msg#: 4557887 posted 12:33 pm on Mar 24, 2013 (gmt 0)

Which Crawlers Does Bing Use? - Bing Webmaster Tools:
http://www.bing.com/webmaster/help/which-crawlers-does-bing-use-8c184ec0 [bing.com]

most well-behaved crawlers' user agent strings include a url referring to the crawler's information page.

in general, the fallback is the robots exclusion standard.
http://www.robotstxt.org/orig.html [robotstxt.org]:
The value of this field is the name of the robot the record is describing access policy for.
If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

for a sample list of robots (not up to date, but whatev) check out the Robots Database:
http://www.robotstxt.org/db.html [robotstxt.org]


WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month

Msg#: 4557887 posted 12:58 pm on Mar 24, 2013 (gmt 0)

Dang! How did I miss that? I was all over robotstxt dot org, of course. Even found their, ahem, slightly elderly list ;) (Inktomi Slurp ?! Jeeves? Lycos?)

I feel left out. I have never been visited by Die Blinde Kuh, nor yet by "Fish search", FunnelWeb, "I, Robot", Kilroy or ParaSite. Some good names there!

Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Sitemaps, Meta Data, and robots.txt
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved