I built my own anti-scraper system last winter. Got tired of scrapers jogging my bandwidth and stealing my pages.
It catches about 1-2 a day. Not surprisingly a lot of them are from reputable companies apparently attempting to monitor their online reputation. I had one company contact me about setting up a custom rss feed for them on user comments on our site. They said they typically use a program to monitor such information but it was not working on our site for some reason. :-). I don't mind them monitoring if they do it nicely. But most just tear through our site with little regard for our server.
Timely topic as PubCon has a session on this very hot topic, again, but it's the last day so I hope people stick around and pay attention this time.
Funny thing is most people say they don't care about scrapers until it hits the fan and by then the damage is done.
|The emerging business of web scraping provides some of the raw material for a rapidly expanding data economy. Marketers spent $7.8 billion on online and offline data in 2009, according to the New York management consulting firm Winterberry Group LLC. Spending on data from online sources is set to more than double, to $840 million in 2012 from $410 million in 2009. |
That's why people scrape - money.
If you stop them from scraping they *may* have to share the wealth to get what they want.
Very simple reason why my sites have legit bots whitelisted so all else like "80legs" in their article get the bounce. NOARCHIVE is used to prevent scraping SE cache and internet archives are disabled. Plus a whole lot more.
Sorry, my job isn't to feed leeches, my resources aren't for being leeched, and if scrapers go broke tomorrow it won't be a day too soon.
Anyone know the User-agent or IP range for the Neilson company scrapper mentioned in the article?
What would be nice if the legal system was setup so you could sue or at least easily shutdown sites that straight-up copy your website. The way it is now...well, it's like the Wild West if you're a small site with unique content.
Has there been ANY attempt within the open source movement to develop solutions to the bot-scraper plague?
Since the problem is well known and likely to keep growing it's a bit surprising that something akin to Akismet-for-blocking-site-ripper-bots hasn't emerged.
|Since the problem is well known and likely to keep growing it's a bit surprising that something akin to Akismet-for-blocking-site-ripper-bots hasn't emerged. |
I agree, but I think we'll have to wait until your average webmaster becomes aware of this issue, and that probably won't happen until it gets a lot more negative publicity.
All this data scraping is like gold mining, so it's inevitable that there will be money in selling both shovels (scraper software) and the means to keep other people from raiding your nuggets of information.
Would flood protection help to slow down the scrapers? I know it will not stop them, but it may make it less productive to have it turned on.
Flood protection would only slow down the most amateurish of the scrapers. Some of them tend to be smarter than that and may even have randomised timing to make them look like a human user. What often gives them away is their sharp difference from human browsing patterns.
This is not something I have studied, however, scraping seems to fall into two categories - Commercial Research and Adsense.
Google permits scraping because it doesn't care about intellectual property (except its own) and because they are now happy to be evil provided there's money in it and provided Joe Public doesn't catch on (that they are being evil).
So, to reduce the problem, either it must be made unprofitable to Google (that means court cases - good luck with that) or Joe Public must be educated as to what's going on and that it's all Google's fault (blame Bing as well if you like).
Copying data for any profitable purpose is likely to be a breach of copyright. A couple of test cases will be required to establish that is true even if the data is not republished. However, we'll still be in cival law territory - to really make a difference, a precedent would need to be set establishing that scraping is a breach of criminal law. Personally, I would think this is doable but I'm not a lawyer. However, international agreements would still be needed and that's not going to happen quickly.
It's hard to see any way to defeat scrapers altogether by blocking - even if you come up with the perfect piece of software, the potential currently exists to use botnets and defeating them will be real tricky. However, when the big boys are caught at it, naming and shaming might help.
|Flood protection would only slow down the most amateurish of the scrapers. Some of them tend to be smarter than that and may even have randomised timing to make them look like a human user. What often gives them away is their sharp difference from human browsing patterns. |
Flood protection still works. Even those that avoid floods trip up in other ways like stepping into a honeypot crawling their 3rd link, something the browsers nor the SEs do. There are other tells as well, such as when the scraper skews the avg. pages read by avg. humans.
|It's hard to see any way to defeat scrapers altogether by blocking - even if you come up with the perfect piece of software, the potential currently exists to use botnets and defeating them will be real tricky. However, when the big boys are caught at it, naming and shaming might help. |
Blocking all data centers is a big start.
Some botnets like "80legs" identify themselves, no problem there.
Many other botnets use fake UAs, fake headers, rotating UAs for the same IP (not common) or do other stupid things which make them pretty easy to spot most of the time.
I guess dynamic pages - cms systems would have some sort of advantage on this, even blocking content after reaching X requests per hour. Trouble is, some sites have too much traffic and might hit hard on the server, very interesting points of view here
Someone in my forum read that article and came into our community forum to ask about it, and now a few people are riled up about their personal data. All of our forums are password-protected and all of our registrations are auto-checked against known spammers and baddies. That, plus flood protection, plus knowing how consistent our traffic and pageviews per user are all make me feel relatively comfortable.
I also keep telling people not to post anything they don't want people reading. That's the hardest part for my members to swallow ... they don't understand that in a forum, they're the ones putting the information out there.
Some info if you G for:
web content defense against data scrappers
From a scrappers POV our biggest problem is limited supply of cloud services and/or good services to shift ip-addresses geo-tageted at our country. Right now its expensive and time consuming for us to rent hundreds of ip-addresses, getting them blocked, running data through slow proxies etc.
IPv6 will make scrapping so much cheaper and easier for us. So our hope is that all webmasters is going to implement it asap. :)
Im no technical expert - just wanted to give a heads up, ipv6 will make scrapping a lot easier for some scrappers at least. (regarding scrapping protection - not entirely off topic i hope.)
Why scrape when you can bake? Anyone care for a supercookie?
New Web language HTML 5 can track users closely
By TANZINA VEGA, The New York Times via The Seattle Times [seattletimes.nwsource.com...]
I spend unrecompensed hours every single week fighting scrapers (and bots, and hackers, and whatevers) on my sites and in my browsers and machines. Hmm. I don't know which annoys me more -- that I've yet to dream up how to beat 'em or jump on the $craper-boom bandwagon with 'em:)
Good article here all about it:
The Intractable Screen Scraping Paradox
Jason Coombs is Director of Forensic Services for PivX Solutions Inc. (NASDAQ OTCBB: PIVX), a provider of security solutions, computer forensics, and expert witness services.
LSO's have been around for a while---and are forever. One reason I FF with BetterPrivacy installed. At Worst they can only get/share that session info because closing the browser clears it... AND I DO CLOSE MY BROWSER SEVERAL TIMES A DAY... you (meaning everyone) do, too, right?
LSO's may only installed when you allow flash and so just be careful when to run it.
|I FF with BetterPrivacy installed |
add to it RequestPolicy, ABP, noscript, cookiesafe just to be on the safe side. And liveheaders to setup the whatever header you want on the client side. And so I am always against relying on headers to judge what someone tries to do.
Unfortunately what Pontus_swe said is true. Having the IPV6 will make dbases of ips much harder to manage and follow, at the same time resources will become cheaper for scraping.
|RequestPolicy, ABP, noscript, cookiesafe just to be on the safe side |
I'm not sure what any of this privacy stuff has to do with scrapers. Nothing. Off Topic.
Scrapers are the bottom feeding scum sucking underbelly of the internet that have absolutely no regard for COPYRIGHT and will do anything to make their income by stealing your income.
Scrapers do evil, there is no other way around it.
Not only that, you won't hide behind IPV6 because blocking entire hosting companies and cloud services will shut down scrapers just as effectively as it does in IPV4.
Scrapers can run but they CANNOT HIDE!
... for long :)
Mainly because scrapers don't realize the content they scrape often identifies the sources of the scraping, honey pots, which help bot blockers ID and block the source.
|I'm not sure what any of this privacy stuff has to do with scrapers. Nothing. Off Topic. |
It is related with methods webmasters deploy to identify humans vs bots and decide upon scraping. So you can't tell reliably, when browser filters are used, what's going on. Just because you block an IP when the UA is empty, doesn't mean there is a scrapper behind. Or when someone triggers a honeypot, because he uses a filter like the ones mentioned above, to strip out resources and then honeypot links become visible and can be accessed.
So my opinion is, without manual examination scraping identification methods may fail. And can only hurt site owners, not scrappers, because they likely use compromised systems.
So privacy can complicate scraping identification methods.
|It is related with methods webmasters deploy to identify humans vs bots and decide upon scraping |
It's not related at all.
Scraping is about stealing from a website, there's no privacy issues involved in scraping.
It's more DMCA related.