homepage Welcome to WebmasterWorld Guest from 54.234.225.23
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
New Scraping Tool
keyplyr




msg:4625968
 3:27 am on Nov 26, 2013 (gmt 0)



Had a 150 page site scraped (HTML only) 3 to 4 pages per second, by a tool using a unique IP and a unique UA each time; over 50 IP ranges (150 unique IP addresses) and approx 50 different, normal looking spoofed UAs. All Headers looking normal with no identifying characteristics.

Since this did not set off any warnings on my end, it was only caught by a manual log viewing. All participating IP ranges were/are server farms many smaller, unknown to me (now blocked) and at least 2 web-security companies.

Heads-up, no doubt it is coming your way. Wish I could give a better warning. Maybe one of you can find something specific we can block.

 

incrediBILL




msg:4625991
 5:11 am on Nov 26, 2013 (gmt 0)

How about sharing the IP ranges involved so it can be blocked in advance?

The fact that a security company <chuckles> was involved makes me think it's compromised machines doing this.

keyplyr




msg:4626002
 6:59 am on Nov 26, 2013 (gmt 0)

Well that's my point. Except for a couple (new for me) security company ranges, the rest are well known servers farms, colos or data centers already identified here in the forum. It would be redundant for me to list them, not to mention a PITA to cut'n paste 150 scattered IPs from yesterday's logs.

Compromised machines or distributed software, for examples: genieo or 80legs?

Angonasec




msg:4626052
 12:36 pm on Nov 26, 2013 (gmt 0)

CrawlWall will fix this for us, I applied to beta-test it last week.

jmccormac




msg:4626053
 12:41 pm on Nov 26, 2013 (gmt 0)

Could it be that Hidemyass service being used for scraping? Spent a few weeks building a datacentre map of the net and identifying IP ranges as part of a project that never materialised. That scraper profile is very similar to a botnet using one of those privacy services.

Regards...jmcc

keyplyr




msg:4626127
 5:43 pm on Nov 26, 2013 (gmt 0)


@jmcc possibly, but all the IPs were various servers, not Hidemyas ranges.

jmccormac




msg:4626515
 8:13 am on Nov 28, 2013 (gmt 0)

@Keyplyr Apart from the obvious /24 or greater ranges, that service also uses small subnets in US data centres and one ISP in Morocco and a Swiss VPN provider.

Regards...jmcc

keyplyr




msg:4626520
 8:57 am on Nov 28, 2013 (gmt 0)

@jmcc So Hidemyass can crawl a 150 page site in approx 40 seconds, changing IP addresses each time (150 unique IP addresses, almost all different server farms) as well as spoofing a different UA each time, only requesting HTML?

I thought it was just a hidden proxy service. So it's a scraping tool as well? I didn't see that when I read their home page.

Most botnet activity I've seen focuses on the same several hacks or vulnerability exploit attempts, from different infected machines. I've not seen a coordinated crawl from 150 different machines all knowing what unique web page to request from a file server, all in sequential time.

I'm still going with the theory that this is a site-wide scraping tool, yet to be identified.

jmccormac




msg:4626522
 9:07 am on Nov 28, 2013 (gmt 0)

Not sure about headers as I wasn't checking headers as well when someone using that service tried to download a 450 million page site. The UAs and the IPs were changing. The speed and a few other characteristics looked off so I grepped the IPs and checked them against the database here. Some botnets tend to be slow and try to be stealthy but this was a very aggressive scraper. It might have been just one bad client who decided to misuse the service.

Regards...jmcc

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved