I do currently block these but I'm wondering if I'm being overly cautious. As far as I do know, these are typically used by scrapers - is this still correct?
Thanks
wilderness
12:33 pm on Jan 7, 2015 (gmt 0)
The python UA is included in just about every example of htaccess denies.
There are a dozen or two common terms that are abused by harvesters and should be part of every black-list. spider and crawler are the most abused. Other common terms are 'synonyms of download'.\
FWIW, in more than 15-years I've a mere three references to 'Ruby', and they were from IP's that wouldn't gain access anyway (one of which was Amazon).
roshaoar
12:42 pm on Jan 7, 2015 (gmt 0)
Everytime I tweet out a link, I get visited by ruby, visits to graphics. Always amazonAWS IPs (many different).
topr8
1:00 pm on Jan 7, 2015 (gmt 0)
personally i block all amazonAWS, ditto python UA's
i'd not seen ruby before, but i'd block that too - i can't see it being of any help.
roshaoar
1:17 pm on Jan 7, 2015 (gmt 0)
Amazon AWS is just weird. Almost every day, on a 25hr cycle, I get 4x as much traffic to one site from fake googlebot "gocrawl" crawlers from AWS as everything else put together. 100s and 100s of IPs. I block 'em, they keep coming. They look at robots first but don't take the blindest notice.