Nobody can block ALL.
Heck, I can give you a single server field to check that will let you block over 80% which is a LOT but not all. Blocking all is impossible but censoring input from data centers around the globe and potentially blocking about a dozen countries that cause most of the scraping will get you damn close.
I've never had an account there to really put it to the test myself, but a client put up a test site for a few weeks and gave me access and it was just OK from what I saw.
However, it does something which is better than nothing but don't go in with high expectations.
You're better off getting a bot blocking script IMO, even if you use CloudFlare.
Thanks for your reply. Yes, I understand that not all bad bots can be stopped, and I don't even ask for that.
What interests me is what you mention in the 2nd paragraph. Is what you mention in your 2nd paragraph (server farms and data centers) what CloudFlare would allows us to block? and can this be done manually by adding some IP ranges and a some rules to the .htaccess file? I don't know about server farms and data centers enough to be able to extrapolate them to how CloudFlare (or our server) works.
We really are only interested in stopping web scrapers and are also interested in blocking all non-English speaking countries (where English is not the main tongue) since that's our market. So long as blocking all these countries doesn't affect our site speed and our on-site SEO (i.e. site speed again), then we are happy to go with whatever it takes to stop these scrapers. If that means blocking 90% of the world, then we are happy to do that, either via CloudFlare (who has confirmed that there is no slowing of a site by blocking countries with CloudFlare) or via .htaccess, which we believed would slow our site dramatically but which in another thread in this same forum section, a forum member has mentioned that the impact on page speed is minimal and that would even be when blocking huge sections of the world anyway.
The above leads me to the next question, are there any .htaccess rules or server-endogenous tricks to detect how a web scraper works when it visits a site? I am reading the thread on recognizing the ID of visiting bots, but it looks a complicated matter as you can get in big trouble if you do it wrong, plus it isn't just about inserting lines in .htaccess (from what I can gather). My belief was that CloudFlare would simply detect a scraper (or bad bot) with an algorithm they have that is based on bot behavior, which would certainly simplify the process for us (albeit by paying a good amount of money each month, that is).
I would sincerely appreciate if you could please reply to my thoughts above, which boil down to:
1) Can we block 90% of the world and other IP ranges of shady server farms with .htaccess without majorly impacting our site, its resources and its page loading time? The site we want to implement this on is a forum hosted in a VPS with SSD 30 gigas HD and 2 gigas RAM (please let me know if you need more on our VPS's spec). Currently, the site only consumes about 1/8 of its alloted resources (but it's growing fast in traffic).
2) Are there any other options to block scrapers aside from blocking countries and shady server farms with .htaccess? I am talking about server side, since we are already controlling on-page factors to dissuade/make the job harder for scrapers.
3) Considering what we need in 1) and 2) above and the alternatives you or anyone else could see, is CloudFlare still worth it? OFC, this is something for our team to decide, but I'd appreciate any opinions on this. Surely, if we can do 1) and 2) with just modifying the .htaccess and perhaps asking our webhost for a couple of extra things on the firewall, then CloudFlare would not be worth it as right now the other benefits of CloudFare are not priorities us.
See this thread [webmasterworld.com] for the perils of using CloudFare (visitor IP's are not available in raw access logs).
Thanks for that. But didn't CloudFlare already improve the issue of not being able to see IPs with a certain module? I really cannot recall what module it was and have tried a quick Google search for its name. I think it was Rocket Loader. What it does is it returns the actual IPs of visitors to the server instead of looking like everything comes from the CloudFlare firewall.
I am deeply researching CloudFlare so thanks for the referenced thread. CloudFlare looks good on paper, but I have read that only their business plan ($200/month) and above are really worth it for medium-trafficked sites (and by medium-trafficked, I am talking of anything with +5000 uniques daily, which in reality is nothing).
Thank you all for any replies!