Forum Moderators: not2easy
Then I looked at the detail data in the logs and realized that somebody has been, over the past 3 days, systematically downloading every single page from my site using some sort of an automated program that allows you to spoof a refferrer and the user-agent. The fact that his ip traces to a DSL account and that ridiculous refferer means that it's certainly not a bot.
I quickly denied his ip with a .htaccess file, but the damage has already been done. I have no idea how he is going to use my content, but most likely it will end up on a splog or some spammy website that will only dilute my Google rankings.
From now on I am going to be checking my server logs every day an anyone (except a bot) who downloads an unhuman amount of pages without it registering on my other trackers will get his ip denied. Of course, this guy could have just as easily masqueraded as GoogleBot and I would have been non the wiser.
I also plan to sprinkle my content with hidden JavaScript and images so that it could "phone home" if it ends up somewhere else. Of course anyone determined enough could simply parse those things out.
Any other ideas?
I took the idea one step further. I set up a bunch of spider trap urls (indistinguishable from my actual URLs) using RewriteRules that all lead to the same spider trap PHP page. Once the same IP falls 10 times into the trap, I send out an email to myself.
Then, instead of doing a deny, I keep everything the way it was, except, I begin replacing all my content with articles about Santa Claus, the Easter Bunny or Chewbacca (chosen randomly) for that IP. This way, by the time he realizes what happened, he would have downloaded hundreds of megabytes of garbage. I have almost unlimited bandwith so this is not a problem for me.