lucy24 - 1:06 am on Mar 24, 2013 (gmt 0)
That was a serious question. Next door in Apache I would not need to ask because the answer is straightforward: the "User Agent" is any substring of the full user-agent string, with further constraints --anchors, case-sensitivity, lookarounds etc. --at the site owner's discretion. If you meet an unanchored rule that says "like Firefox 3.5", it's no use arguing that your UA string says "NOT like Firefox 3.5".
But what about robots.txt? It's all voluntary, so the robot is welcome to say "Oh, sorry, didn't know you meant me, nobody ever calls me Rob".
Assuming for the sake of discussion that your visiting robot wants to follow robots.txt: What's it called? They don't wear name tags, so all you've got is the full UA string.
I found this useful page [developers.google.com] at google. It's really https, but it won't link that way, even in "url=" tags. Other neighboring pages supply additional information such as
The directives listed in the robots.txt file apply only to the host, protocol and port number where the file is hosted.
(psst! Helen! weren't you just asking about this?) and
The guidelines set forth in this document are followed by all automated crawlers at Google. When an agent accesses URLs on behalf of a user (for example, for translation, manually subscribed feeds, malware analysis, etc), these guidelines do not need to apply.
(Nice to see that in print: don't remember meeting it before.)
There's a similar list at Yandex [help.yandex.com].
But I'm ### if I can find equivalent information for No.-2-We-Try-Harder. The URL given in their UA string leads you around in circles, ending up with a down-the-drain sort of spiral. If you say "msnbot" will it cover msnbot-media? Will it cover the bingbot on general "we know what you mean" principles-- or nobody? What if you spell it MSNbot?
I tossed that out as an example, but the underlying question applies to any well-intentioned robot.