Might anybody have a list of User-Agent exceptions.
Is it just Google and Bing?
dstiles
9:31 pm on Feb 15, 2012 (gmt 0)
I assume you mean bot UAs.
The list below is a very rough guide to what I allow but some bots are limited to certain versions (eg no media bots). Please don't ask what they all are - it's been a while and I'm no longer sure myself; some may not even exist now! Some were added at customers' requests (eg tripadvisor, linkedin). If anyone has any adverse experience or knowledge of them, I'd appreciate knowing.
Note; if you black-list the common term "spider", you'll need to list a line (s) excluding Baiduspider from that denial. Don't recall if Yandex uses the "crawler" term, if so than more exclusion lines for that abused term.
keyplyr
11:59 pm on Feb 15, 2012 (gmt 0)
IMO the white-list will (should) be different for every site. While I allow a large portion of dstiles' list, I don't allow any Nutch, Twitter parasites or anything from Asia.
lucy24
12:07 am on Feb 16, 2012 (gmt 0)
Don't recall if Yandex uses the "crawler" term, if so than more exclusion lines for that abused term.
I allowed Jeves in the past, however there was very little benefit for all their crawling.
Is Yahoo still crawling? I though their spidering had been contracted to another? I haven't seen them since reactivation.
I don't allow anything in dstiles list except the two major SE's, in fact, many of these are listed in my UA's blacklist.
keyplyr
8:41 am on Feb 16, 2012 (gmt 0)
Is Yahoo still crawling?
Absolutely. Only their search index is supplied by Bing.
wilderness
9:01 am on Feb 16, 2012 (gmt 0)
yahoo ;)
lucy24
9:21 am on Feb 16, 2012 (gmt 0)
Absolutely. Only their search index is supplied by Bing.
Is there a vanilla YahooBot? During the month that I tracked all my robots, all I got was:
--a single visit from Yahoo! Slurp which slurped up a single page with all images --recurring visits from YahooCacheSystem, picking up the front page + favicon
Not much to index there. I have to go back to December to find it asking for anything else-- including, ahem, robots.txt ;) Way back in November it slurped up a different picture-intensive page. But that's it.
keyplyr
7:06 pm on Feb 16, 2012 (gmt 0)
Besides Slurp & YahooCacheSystem there are more than a few covert crawls which are anyone's guess what their up to, but IMO the important thing is not to forget their still in the game.
I see Slurp often. Like everything else, it depends on the site.
dstiles
8:57 pm on Feb 16, 2012 (gmt 0)
Just picked up on HuaweiSymantecSpider from another thread hereabouts. I originally had it allowed but put it in the block list as well (which pre-empts the Allow). So - shouldn't be in my list above.
Keyplr - I almost never allow nutch but this one was an exception (the other was an old version of Yell). Cabot is the Amfibi SE's bot. To be fair, although Amfibi still exists I have no idea if it still a) crawls and b) still includes nutch in the UA. The twitter and facebook bots are VERY limited: to start with, they have to be IP-based and NOTHING from AWS. They exist only because a few customers have requested them.
Edit by dstiles: Just noticed at the foot of Amfibi's bot page: "Powered by Nutch". Whether it still includes that in the UA isn't clear.