| 5:11 am on Nov 26, 2013 (gmt 0)|
How about sharing the IP ranges involved so it can be blocked in advance?
The fact that a security company <chuckles> was involved makes me think it's compromised machines doing this.
| 6:59 am on Nov 26, 2013 (gmt 0)|
Well that's my point. Except for a couple (new for me) security company ranges, the rest are well known servers farms, colos or data centers already identified here in the forum. It would be redundant for me to list them, not to mention a PITA to cut'n paste 150 scattered IPs from yesterday's logs.
Compromised machines or distributed software, for examples: genieo or 80legs?
| 12:36 pm on Nov 26, 2013 (gmt 0)|
CrawlWall will fix this for us, I applied to beta-test it last week.
| 12:41 pm on Nov 26, 2013 (gmt 0)|
Could it be that Hidemyass service being used for scraping? Spent a few weeks building a datacentre map of the net and identifying IP ranges as part of a project that never materialised. That scraper profile is very similar to a botnet using one of those privacy services.
| 5:43 pm on Nov 26, 2013 (gmt 0)|
@jmcc possibly, but all the IPs were various servers, not Hidemyas ranges.
| 8:13 am on Nov 28, 2013 (gmt 0)|
@Keyplyr Apart from the obvious /24 or greater ranges, that service also uses small subnets in US data centres and one ISP in Morocco and a Swiss VPN provider.
| 8:57 am on Nov 28, 2013 (gmt 0)|
@jmcc So Hidemyass can crawl a 150 page site in approx 40 seconds, changing IP addresses each time (150 unique IP addresses, almost all different server farms) as well as spoofing a different UA each time, only requesting HTML?
I thought it was just a hidden proxy service. So it's a scraping tool as well? I didn't see that when I read their home page.
Most botnet activity I've seen focuses on the same several hacks or vulnerability exploit attempts, from different infected machines. I've not seen a coordinated crawl from 150 different machines all knowing what unique web page to request from a file server, all in sequential time.
I'm still going with the theory that this is a site-wide scraping tool, yet to be identified.
| 9:07 am on Nov 28, 2013 (gmt 0)|
Not sure about headers as I wasn't checking headers as well when someone using that service tried to download a 450 million page site. The UAs and the IPs were changing. The speed and a few other characteristics looked off so I grepped the IPs and checked them against the database here. Some botnets tend to be slow and try to be stealthy but this was a very aggressive scraper. It might have been just one bad client who decided to misuse the service.