The list below is a very rough guide to what I allow but some bots are limited to certain versions (eg no media bots). Please don't ask what they all are - it's been a while and I'm no longer sure myself; some may not even exist now! Some were added at customers' requests (eg tripadvisor, linkedin). If anyone has any adverse experience or knowledge of them, I'd appreciate knowing.
Note; if you black-list the common term "spider", you'll need to list a line (s) excluding Baiduspider from that denial. Don't recall if Yandex uses the "crawler" term, if so than more exclusion lines for that abused term.
Absolutely. Only their search index is supplied by Bing.
Is there a vanilla YahooBot? During the month that I tracked all my robots, all I got was:
--a single visit from Yahoo! Slurp which slurped up a single page with all images --recurring visits from YahooCacheSystem, picking up the front page + favicon
Not much to index there. I have to go back to December to find it asking for anything else-- including, ahem, robots.txt ;) Way back in November it slurped up a different picture-intensive page. But that's it.
Just picked up on HuaweiSymantecSpider from another thread hereabouts. I originally had it allowed but put it in the block list as well (which pre-empts the Allow). So - shouldn't be in my list above.
Keyplr - I almost never allow nutch but this one was an exception (the other was an old version of Yell). Cabot is the Amfibi SE's bot. To be fair, although Amfibi still exists I have no idea if it still a) crawls and b) still includes nutch in the UA. The twitter and facebook bots are VERY limited: to start with, they have to be IP-based and NOTHING from AWS. They exist only because a few customers have requested them.
Edit by dstiles: Just noticed at the foot of Amfibi's bot page: "Powered by Nutch". Whether it still includes that in the UA isn't clear.