homepage Welcome to WebmasterWorld Guest from 23.20.28.193
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Can Cloudfare stop content scrapers?
Andy500




msg:4662878
 1:38 pm on Apr 14, 2014 (gmt 0)

I would like to ask if anybody here has any experience with Cloudfare successfully blocking content scrapers (i.e. bots) or at least helping in reducing a good amount of them accessing a site and stealing its content.

Web scrapers are always evolving, so maintaining an updated database would be a chorus, but, from their website, they seem to claim that they block the "bad bots". Likewise, identifying a content scraper is not easy although Cloudfare seems to advertise the service as being able to identify bad bots even if not listed officially as such. Cloudfare also charges a hefty amount for their business and beyond plans, so one would expect that they'd have a solid plan against content scrapers, aside from their Scrapeshield option where they simply attempt to track any scraped content.

Does anybody use Cloudfare and have experience with this specific issue? Or can make an educated guess? FWIW, I have heard from them and they claim to be able to block content scrapers but not all; however, I have yet to hear anyone's experience on this specific issue (at least from reliable non-biased sources).

It'd be great to hear some replies as I am sure this topic would benefit others too.

Many thanks!

Andy

 

incrediBILL




msg:4662900
 3:00 pm on Apr 14, 2014 (gmt 0)

I have heard from them and they claim to be able to block content scrapers but not all


Nobody can block ALL.

Heck, I can give you a single server field to check that will let you block over 80% which is a LOT but not all. Blocking all is impossible but censoring input from data centers around the globe and potentially blocking about a dozen countries that cause most of the scraping will get you damn close.

I've never had an account there to really put it to the test myself, but a client put up a test site for a few weeks and gave me access and it was just OK from what I saw.

However, it does something which is better than nothing but don't go in with high expectations.

You're better off getting a bot blocking script IMO, even if you use CloudFlare.

wilderness




msg:4662903
 3:06 pm on Apr 14, 2014 (gmt 0)

See this thread [webmasterworld.com] for the perils of using CloudFare (visitor IP's are not available in raw access logs).

Andy500




msg:4662927
 5:12 pm on Apr 14, 2014 (gmt 0)

Nobody can block ALL.

Heck, I can give you a single server field to check that will let you block over 80% which is a LOT but not all. Blocking all is impossible but censoring input from data centers around the globe and potentially blocking about a dozen countries that cause most of the scraping will get you damn close.

I've never had an account there to really put it to the test myself, but a client put up a test site for a few weeks and gave me access and it was just OK from what I saw.

However, it does something which is better than nothing but don't go in with high expectations.

You're better off getting a bot blocking script IMO, even if you use CloudFlare.


Thanks for your reply. Yes, I understand that not all bad bots can be stopped, and I don't even ask for that.

What interests me is what you mention in the 2nd paragraph. Is what you mention in your 2nd paragraph (server farms and data centers) what CloudFlare would allows us to block? and can this be done manually by adding some IP ranges and a some rules to the .htaccess file? I don't know about server farms and data centers enough to be able to extrapolate them to how CloudFlare (or our server) works.

We really are only interested in stopping web scrapers and are also interested in blocking all non-English speaking countries (where English is not the main tongue) since that's our market. So long as blocking all these countries doesn't affect our site speed and our on-site SEO (i.e. site speed again), then we are happy to go with whatever it takes to stop these scrapers. If that means blocking 90% of the world, then we are happy to do that, either via CloudFlare (who has confirmed that there is no slowing of a site by blocking countries with CloudFlare) or via .htaccess, which we believed would slow our site dramatically but which in another thread in this same forum section, a forum member has mentioned that the impact on page speed is minimal and that would even be when blocking huge sections of the world anyway.

The above leads me to the next question, are there any .htaccess rules or server-endogenous tricks to detect how a web scraper works when it visits a site? I am reading the thread on recognizing the ID of visiting bots, but it looks a complicated matter as you can get in big trouble if you do it wrong, plus it isn't just about inserting lines in .htaccess (from what I can gather). My belief was that CloudFlare would simply detect a scraper (or bad bot) with an algorithm they have that is based on bot behavior, which would certainly simplify the process for us (albeit by paying a good amount of money each month, that is).

I would sincerely appreciate if you could please reply to my thoughts above, which boil down to:

1) Can we block 90% of the world and other IP ranges of shady server farms with .htaccess without majorly impacting our site, its resources and its page loading time? The site we want to implement this on is a forum hosted in a VPS with SSD 30 gigas HD and 2 gigas RAM (please let me know if you need more on our VPS's spec). Currently, the site only consumes about 1/8 of its alloted resources (but it's growing fast in traffic).

2) Are there any other options to block scrapers aside from blocking countries and shady server farms with .htaccess? I am talking about server side, since we are already controlling on-page factors to dissuade/make the job harder for scrapers.

3) Considering what we need in 1) and 2) above and the alternatives you or anyone else could see, is CloudFlare still worth it? OFC, this is something for our team to decide, but I'd appreciate any opinions on this. Surely, if we can do 1) and 2) with just modifying the .htaccess and perhaps asking our webhost for a couple of extra things on the firewall, then CloudFlare would not be worth it as right now the other benefits of CloudFare are not priorities us.

Many thanks!


See this thread [webmasterworld.com] for the perils of using CloudFare (visitor IP's are not available in raw access logs).


Thanks for that. But didn't CloudFlare already improve the issue of not being able to see IPs with a certain module? I really cannot recall what module it was and have tried a quick Google search for its name. I think it was Rocket Loader. What it does is it returns the actual IPs of visitors to the server instead of looking like everything comes from the CloudFlare firewall.

I am deeply researching CloudFlare so thanks for the referenced thread. CloudFlare looks good on paper, but I have read that only their business plan ($200/month) and above are really worth it for medium-trafficked sites (and by medium-trafficked, I am talking of anything with +5000 uniques daily, which in reality is nothing).

Thank you all for any replies!

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved