Forum Moderators: open

Message Too Old, No Replies

Crawlera

         

keyplyr

10:01 am on Nov 10, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



UA: Mozilla/5.0 (compatible; Crawlera/1.6.10; UID 54433)
Protocol: HTTP/1.1
Robots.txt: No
Host: proxies

From crawlera.com
Crawlera is a smart downloader designed specifically for web crawling and scraping. It allows you to crawl quickly and reliably, managing thousands of proxies internally, so you don’t have to.

Crawlera routes requests through a pool of IPs, throttling access by introducing delays and discarding IPs from the pool when they get banned from certain domains, or have other problems.
Hit one of my sites requesting the same file 30 times, using approx 30 IPs, mostly from various server farms - all 403. Hits were 3 seconds apart. The "UID" number in the UA string was unique for each request.

lucy24

7:21 pm on Nov 10, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



:: detour to raw logs ::

The string "\bUID", case sensitive, at word boundary, does not occur in any legitimate human UA. (Without the word break, there are a handful, notably in recent MSIE.) So it's blockable that way.

keyplyr

7:59 pm on Nov 10, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



This is a bad agent for webmasters. It's a blatant scraper, made to deal with defensive measures on our part. More on their homepage.

lucy24

9:28 pm on Nov 10, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



More on their homepage.

Considering that the actual word "scraping" occurs in the very first sentence of their prose, they don't leave a lot of room for ambiguity do they. And it only goes downhill from there.

:: further detour to raw logs ::

If a universal "[Cc]rawler" block isn't practicable, there's also "Crawlera". (I had to make sure that there wasn't anything like "Crawleragent" or "Crawlerabout" buried in legitimate robots' UA strings.)

Matter of fact, considering what they're willing to admit for public consumption, it's pretty astonishing that they even use an identifiable UA name. Most scrapers do have a UA-spoofing option; I suppose theirs is buried somewhere in the only-after-you've-paid area.

Ugh. I think they've incorporated another nasty. Their sitename doesn't show up in my browser history, and the browser's autocomplete doesn't suggest the name on a return visit. Poring over page source and linked script I do find the strings "history" and "autocomplete", but as I only speak four words of javascript this doesn't get me far. I also find GA cross-links to two other domain names involving the "scrape" element. Dog Bites Man.

blend27

2:56 am on Nov 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



discarding IPs from the pool when they get banned

Just a thought,

Don't ban the IP right away, give them a binary dump of your favorite PIC, in a text format replacing random characters within a dump. See where that bleeds them. They would like that ;)

blend27

3:09 am on Nov 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Another thought:

Why can't we crowd -source a service that does the blocking of things like this?

keyplyr

3:32 am on Nov 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Simple solution, don't play their game... just block all known server farms & only let humans through. This should already be the case for any webmaster but if not, this could be a convenient way to find those server ranges :)

blend27

3:38 am on Nov 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



easy peasy :)

keyplyr

3:47 am on Nov 12, 2015 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Well we all started as newbies without defenses, then we did the research & learned to do the work. Now we try to keep up.