|what is the User-Agent?|
That was a serious question. Next door in Apache I would not need to ask because the answer is straightforward: the "User Agent" is any substring of the full user-agent string, with further constraints --anchors, case-sensitivity, lookarounds etc. --at the site owner's discretion. If you meet an unanchored rule that says "like Firefox 3.5", it's no use arguing that your UA string says "NOT like Firefox 3.5".
But what about robots.txt? It's all voluntary, so the robot is welcome to say "Oh, sorry, didn't know you meant me, nobody ever calls me Rob".
Assuming for the sake of discussion that your visiting robot wants to follow robots.txt: What's it called? They don't wear name tags, so all you've got is the full UA string.
I found this useful page [developers.google.com] at google. It's really https, but it won't link that way, even in "url=" tags. Other neighboring pages supply additional information such as
|The directives listed in the robots.txt file apply only to the host, protocol and port number where the file is hosted. |
(psst! Helen! weren't you just asking about this?) and
|The guidelines set forth in this document are followed by all automated crawlers at Google. When an agent accesses URLs on behalf of a user (for example, for translation, manually subscribed feeds, malware analysis, etc), these guidelines do not need to apply. |
(Nice to see that in print: don't remember meeting it before.)
There's a similar list at Yandex [help.yandex.com].
But I'm ### if I can find equivalent information for No.-2-We-Try-Harder. The URL given in their UA string leads you around in circles, ending up with a down-the-drain sort of spiral. If you say "msnbot" will it cover msnbot-media? Will it cover the bingbot on general "we know what you mean" principles-- or nobody? What if you spell it MSNbot?
I tossed that out as an example, but the underlying question applies to any well-intentioned robot.
Which Crawlers Does Bing Use? - Bing Webmaster Tools:
most well-behaved crawlers' user agent strings include a url referring to the crawler's information page.
in general, the fallback is the robots exclusion standard.
The value of this field is the name of the robot the record is describing access policy for.
If more than one User-agent field is present the record describes an identical access policy for more than one robot. At least one field needs to be present per record.
The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.
If the value is '*', the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.
for a sample list of robots (not up to date, but whatev) check out the Robots Database:
Dang! How did I miss that? I was all over robotstxt dot org, of course. Even found their, ahem, slightly elderly list ;) (Inktomi Slurp ?! Jeeves? Lycos?)
I feel left out. I have never been visited by Die Blinde Kuh, nor yet by "Fish search", FunnelWeb, "I, Robot", Kilroy or ParaSite. Some good names there!