helleborine -- a simple way to strip referer is to deliver a link via an onclick event, and there are numerous other ways to make referer wrong if you are serving the refering page. While you can try to block referer, it will fail for the most determined sites ... probably the once you want to block most. It takes more effort to spoof IP addresses, but it's pretty easy especially through proxies, and now you can get a new IP from the cloud in a matter of minutes. In short another losing battle.
There are two approaches I have tried, for various sites having various content that people like to copy.
The first took me a couple days to code, and a couple of months of refinements every so often, with quarterly touchups. By scanning a 5 to 15 minute stream of requests, it's quite easy to pick out the bots. Typically a given burst of requests will have the same IP, or user agent, or lack of referer (or same referer) -- a few common patterns. My code kept the most recent "suspects" in memory, with counters every time a given instance of a pattern was recognized.
Suspects would get a custom page: one with a honeypot (for crawlers), one with some content that only a determined bot could see was not meant for human consumption, some suitably interdependent javascript that they could easily get confused. Any suspect who failed one or more of the tests would start getting 503's or (my favorite) a really slow response. Often that would be enough to discourage them. Those who went away quickly enough had their details recorded, but were taken off the suspects list.
Repeat offenders got put in a "quick 403" list, which we would review as new entries came in, which wasn't that often, actually. This was just a simple map that Apache used to block based on whatever pattern they had shown. The list recorded original date, most recent offense and number of offenses.
That was enough for our needs, and probably blocked 95% of the unruly bots, either by giving them bogus content, or just by making them go away quickly. The total number of hits on expensive content dropped dramatically, and occurrences where there were enough simultaneous mal-bots attacking our sites to cause capacity or performance problems were effectively eliminated.
We made a simple google alert for the special bogus content that we had delivered and regularly found the sites that had ripped us off, and either reported them to Google, or, in several cases, were able to deliver content that would certainly get them penalized by Google and other SEs. I would have automated that process if I had the chance.
This was some of the most rewarding software I have ever written. We did some similar stuff with the form-bots and our email server was not a friendly place for those who wanted to hack us. There's little that is more gratifying than taking down an attacker by deflecting their punches and using their own power against them -- malware kung fu :-)
But then I went to another company, and found that our needs were different. In this case, we just added enough servers to handle the load that the bots put on us, and trusted Google to do the right thing, which they almost always have. When a page we cared about suddenly changed, we manually check to see how many copies of it there are on the web, and hired an intern to rat on the obvious ones to Google. It's less automated, but it happens once or twice a month, and it requires no special code.
I was thinking about using Amazon Mechanical Turk to do the human part of these tasks, but then realized that's probably how the bots are getting so clever at getting past automated or heuristic prevention techniques. And this is probably the most alarming aspect of the whole thing: people are getting paid pennies to find ways to spam your site, and once you have people using real browsers to outwit your automated attempts, game over. People are always smarter and harder to detect.
I recommend the second approach, unless you have good reason to go for the first.
Tom