Forum Moderators: phranque

Message Too Old, No Replies

Blocking Bots / Scrapers - Possible?

         

SerpsGuy

8:53 am on Oct 1, 2014 (gmt 0)

10+ Year Member



I have dealt with scrapers stealing my content for years, and despite what google says, people stealing your content does work and can hurt the legitimate content owner.

I was sharing this with a buddy of mine today, and he says "isnt there a way to block everyone but bing google and yahoo?".

I dont know, is there a way to do that? I also get traffic from duckduckgo. Anyway to block all bots except the big three and duckduckgo?

Additionally, if anyone has any good methods for blocking scrapers in general, I will read whatever you share and thank you for the tips.

phranque

9:24 am on Oct 1, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



it's an "arms race".
you can exclude well-behaved bots from crawling URLs using the robots.txt protocol.
you can Forbid (with a 403 status code) recognized scraper user agents and IP addresses that are requesting excluded URLs using the RewriteRule directive's [F] flag or the Deny directive.
once a scraper is willing to use stealth to look like a legitimate visitor you have to take more sophisticated measures to recognize and block scrapers without collateral damage.

not2easy

1:22 pm on Oct 1, 2014 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



There is no copy and paste fix because things change, evolve and everyone's server environment is not the same. The simplest way to block robots depends on them showing their credentials at the door, something very easy to fake.

You need to spend time or money to analyze your access logs and learn to identify unwanted behavior patterns. It helps to learn the difference between residential service ISP IP addresses and hosts (or server farms) and keep tabs on the unwanted IPs as they turn up. After a while you should have your own database to help keep track of them all.

There are a few resources here that you can read through: the Apache Documentation pages (found in this forum's Charter) and the Library (link is right next to the Charter) offers many discussions on the topic to help you learn some techniques. SESpiders forum: [webmasterworld.com...] helps you start keeping up with the IP blocks. It is an ongoing process to be effective.