You can't escape that.
What You can do is monitor. Monitor incoming traffic.
We have a forum for that. Start Here: [
webmasterworld.com...] . Create table in you DB, add ranges and block them.
What You can do is on weekly bases search for strings you already see as duplicate in Gorg Search, get domain IP where the dupe content is hosted(btw, the scarper usually does not scrape from that IP, but if the hosting provider allows scrapped content being hosted..., well, it is a hosting range), get IP range from that and block it - 2-3 hours a week, no more.
If You are ECom, Block countries that you don't deal with.
Bad Headers, bad UA's, fake referred pages, bot traps.
They want Robots.txt = Sure, but nothing else from last octet for a while until you verify it, sorry....etc,etc.
If you can do RDNS on incoming IP and it has one of the following in it:
.amazonaws.com
.your-server.de
.server4you.net
.hosteurope.de
.softlayer.com
.theplanet.com
.ovh.net
.xlhost.com
.serverloft.com
.fastwebserver.de
403, but not just a plain 403, record-capture all possible info you can, learn from it.
btw, this does not have to be you main domain :)
Point is, if You rank - then they will try to scrape. The more hosting ranges you got - the better defense is. The more you know your internal linking structure, the better you will be able to say 403. THe better you understand how normal browsers work.... you get the point, I hope.
Think think about this way, where is the weakest point in you defense if you have to scrape it your self?
Blend27