Welcome to WebmasterWorld Guest from 54.166.224.46

Forum Moderators: Ocean10000 & incrediBILL

Message Too Old, No Replies

New Scraping Tool

     
3:27 am on Nov 26, 2013 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month





Had a 150 page site scraped (HTML only) 3 to 4 pages per second, by a tool using a unique IP and a unique UA each time; over 50 IP ranges (150 unique IP addresses) and approx 50 different, normal looking spoofed UAs. All Headers looking normal with no identifying characteristics.

Since this did not set off any warnings on my end, it was only caught by a manual log viewing. All participating IP ranges were/are server farms many smaller, unknown to me (now blocked) and at least 2 web-security companies.

Heads-up, no doubt it is coming your way. Wish I could give a better warning. Maybe one of you can find something specific we can block.
5:11 am on Nov 26, 2013 (gmt 0)

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



How about sharing the IP ranges involved so it can be blocked in advance?

The fact that a security company <chuckles> was involved makes me think it's compromised machines doing this.
6:59 am on Nov 26, 2013 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Well that's my point. Except for a couple (new for me) security company ranges, the rest are well known servers farms, colos or data centers already identified here in the forum. It would be redundant for me to list them, not to mention a PITA to cut'n paste 150 scattered IPs from yesterday's logs.

Compromised machines or distributed software, for examples: genieo or 80legs?
12:36 pm on Nov 26, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



CrawlWall will fix this for us, I applied to beta-test it last week.
12:41 pm on Nov 26, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Could it be that Hidemyass service being used for scraping? Spent a few weeks building a datacentre map of the net and identifying IP ranges as part of a project that never materialised. That scraper profile is very similar to a botnet using one of those privacy services.

Regards...jmcc
5:43 pm on Nov 26, 2013 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month




@jmcc possibly, but all the IPs were various servers, not Hidemyas ranges.
8:13 am on Nov 28, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



@Keyplyr Apart from the obvious /24 or greater ranges, that service also uses small subnets in US data centres and one ISP in Morocco and a Swiss VPN provider.

Regards...jmcc
8:57 am on Nov 28, 2013 (gmt 0)

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



@jmcc So Hidemyass can crawl a 150 page site in approx 40 seconds, changing IP addresses each time (150 unique IP addresses, almost all different server farms) as well as spoofing a different UA each time, only requesting HTML?

I thought it was just a hidden proxy service. So it's a scraping tool as well? I didn't see that when I read their home page.

Most botnet activity I've seen focuses on the same several hacks or vulnerability exploit attempts, from different infected machines. I've not seen a coordinated crawl from 150 different machines all knowing what unique web page to request from a file server, all in sequential time.

I'm still going with the theory that this is a site-wide scraping tool, yet to be identified.
9:07 am on Nov 28, 2013 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Not sure about headers as I wasn't checking headers as well when someone using that service tried to download a 450 million page site. The UAs and the IPs were changing. The speed and a few other characteristics looked off so I grepped the IPs and checked them against the database here. Some botnets tend to be slow and try to be stealthy but this was a very aggressive scraper. It might have been just one bad client who decided to misuse the service.

Regards...jmcc
 

Featured Threads

Hot Threads This Week

Hot Threads This Month