Forum Moderators: phranque
I am running a proxy service..
If someone links to a file on my proxy well lets say the bot may follow every link then spider a 2nd version of the internet through my proxy.. i dont think i have that kind of bandwith!
Nor would like to have that much duplicate content penalisation!
For those spiders that identify themselves properly using the HTTP User-agent request header, this doesn't sound like a very difficult project. You could use the RewriteCond directive of mod_rewrite, testing the server variable %{HTTP_USER_AGENT}, and forbid access to all files requested by the robots you want to exclude.
For stealth 'bots that claim to be browsers, you'll need to collect a list of their IP address ranges, and exclude them by IP address, also using RewriteCond, but with server variable %{REMOTE_ADDR}.
For more information, see the documents cited in our forum charter [webmasterworld.com] and the tutorials in the Apache forum section of the WebmasterWorld library [webmasterworld.com].
You could also use a combination of mod_setenvif and mod_access if mod_rewrite is not available to you, but it's a bit less straightforward.
Jim