Msg#: 4625966 posted 3:27 am on Nov 26, 2013 (gmt 0)
Had a 150 page site scraped (HTML only) 3 to 4 pages per second, by a tool using a unique IP and a unique UA each time; over 50 IP ranges (150 unique IP addresses) and approx 50 different, normal looking spoofed UAs. All Headers looking normal with no identifying characteristics.
Since this did not set off any warnings on my end, it was only caught by a manual log viewing. All participating IP ranges were/are server farms many smaller, unknown to me (now blocked) and at least 2 web-security companies.
Heads-up, no doubt it is coming your way. Wish I could give a better warning. Maybe one of you can find something specific we can block.
Msg#: 4625966 posted 6:59 am on Nov 26, 2013 (gmt 0)
Well that's my point. Except for a couple (new for me) security company ranges, the rest are well known servers farms, colos or data centers already identified here in the forum. It would be redundant for me to list them, not to mention a PITA to cut'n paste 150 scattered IPs from yesterday's logs.
Compromised machines or distributed software, for examples: genieo or 80legs?
Msg#: 4625966 posted 12:41 pm on Nov 26, 2013 (gmt 0)
Could it be that Hidemyass service being used for scraping? Spent a few weeks building a datacentre map of the net and identifying IP ranges as part of a project that never materialised. That scraper profile is very similar to a botnet using one of those privacy services.
Msg#: 4625966 posted 8:57 am on Nov 28, 2013 (gmt 0)
@jmcc So Hidemyass can crawl a 150 page site in approx 40 seconds, changing IP addresses each time (150 unique IP addresses, almost all different server farms) as well as spoofing a different UA each time, only requesting HTML?
I thought it was just a hidden proxy service. So it's a scraping tool as well? I didn't see that when I read their home page.
Most botnet activity I've seen focuses on the same several hacks or vulnerability exploit attempts, from different infected machines. I've not seen a coordinated crawl from 150 different machines all knowing what unique web page to request from a file server, all in sequential time.
I'm still going with the theory that this is a site-wide scraping tool, yet to be identified.
Msg#: 4625966 posted 9:07 am on Nov 28, 2013 (gmt 0)
Not sure about headers as I wasn't checking headers as well when someone using that service tried to download a 450 million page site. The UAs and the IPs were changing. The speed and a few other characteristics looked off so I grepped the IPs and checked them against the database here. Some botnets tend to be slow and try to be stealthy but this was a very aggressive scraper. It might have been just one bad client who decided to misuse the service.