Welcome to WebmasterWorld Guest from 54.196.144.242

Forum Moderators: Ocean10000 & incrediBILL & keyplyr

Message Too Old, No Replies

New Scraping Tool

     
3:27 am on Nov 26, 2013 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:6674
votes: 131




Had a 150 page site scraped (HTML only) 3 to 4 pages per second, by a tool using a unique IP and a unique UA each time; over 50 IP ranges (150 unique IP addresses) and approx 50 different, normal looking spoofed UAs. All Headers looking normal with no identifying characteristics.

Since this did not set off any warnings on my end, it was only caught by a manual log viewing. All participating IP ranges were/are server farms many smaller, unknown to me (now blocked) and at least 2 web-security companies.

Heads-up, no doubt it is coming your way. Wish I could give a better warning. Maybe one of you can find something specific we can block.
5:11 am on Nov 26, 2013 (gmt 0)

Administrator from US 

WebmasterWorld Administrator incredibill is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Jan 25, 2005
posts:14650
votes: 94


How about sharing the IP ranges involved so it can be blocked in advance?

The fact that a security company <chuckles> was involved makes me think it's compromised machines doing this.
6:59 am on Nov 26, 2013 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:6674
votes: 131


Well that's my point. Except for a couple (new for me) security company ranges, the rest are well known servers farms, colos or data centers already identified here in the forum. It would be redundant for me to list them, not to mention a PITA to cut'n paste 150 scattered IPs from yesterday's logs.

Compromised machines or distributed software, for examples: genieo or 80legs?
12:36 pm on Nov 26, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 13, 2003
posts:700
votes: 0


CrawlWall will fix this for us, I applied to beta-test it last week.
12:41 pm on Nov 26, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 30, 2002
posts: 2510
votes: 44


Could it be that Hidemyass service being used for scraping? Spent a few weeks building a datacentre map of the net and identifying IP ranges as part of a project that never materialised. That scraper profile is very similar to a botnet using one of those privacy services.

Regards...jmcc
5:43 pm on Nov 26, 2013 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:6674
votes: 131



@jmcc possibly, but all the IPs were various servers, not Hidemyas ranges.
8:13 am on Nov 28, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 30, 2002
posts: 2510
votes: 44


@Keyplyr Apart from the obvious /24 or greater ranges, that service also uses small subnets in US data centres and one ISP in Morocco and a Swiss VPN provider.

Regards...jmcc
8:57 am on Nov 28, 2013 (gmt 0)

Moderator This Forum from US 

WebmasterWorld Administrator keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:6674
votes: 131


@jmcc So Hidemyass can crawl a 150 page site in approx 40 seconds, changing IP addresses each time (150 unique IP addresses, almost all different server farms) as well as spoofing a different UA each time, only requesting HTML?

I thought it was just a hidden proxy service. So it's a scraping tool as well? I didn't see that when I read their home page.

Most botnet activity I've seen focuses on the same several hacks or vulnerability exploit attempts, from different infected machines. I've not seen a coordinated crawl from 150 different machines all knowing what unique web page to request from a file server, all in sequential time.

I'm still going with the theory that this is a site-wide scraping tool, yet to be identified.
9:07 am on Nov 28, 2013 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 30, 2002
posts: 2510
votes: 44


Not sure about headers as I wasn't checking headers as well when someone using that service tried to download a 450 million page site. The UAs and the IPs were changing. The speed and a few other characteristics looked off so I grepped the IPs and checked them against the database here. Some botnets tend to be slow and try to be stealthy but this was a very aggressive scraper. It might have been just one bad client who decided to misuse the service.

Regards...jmcc