Welcome to WebmasterWorld Guest from 22.214.171.124
Forum Moderators: incrediBILL
I'd consider blocking grabbers from some sites as they can use a lot of bandwidth indiscriminately and they help people copy all your content easily. Of course it depends on what your site is for: there are some kinds of sites (e.g. reference sites for hobbyists) that you might WANT people to be able to copy to their local machines.
If I want to block one bad browser listed as a grabber, can I do it in a robot.txt file or should I specify in .htaccess file?
Before you start losing sleep, bear in mind that usually the "user-agent" or identifying string can be changed, so for example our own robot which we use to locate useful web sites, declares itself as what it is, but could just as easily claim to be Internet Explorer, or Opera or whatever. The same is true of commercial web site copying software, they usually have built in methods for bypassing your attempts to block them - I'm not saying this is good, just that there's not much you can do about it.
GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc.
It is an incredibly useful tool, but if it is consistently being (mis)used by a third party to download your site then you should attempt to block it. In spidering mode it obeys robots.txt (User agent: Wget) unless the user explicitly disables the option. The user can also change the user agent identification string, so a really determined person can still rip your site with wget even if you block the string "wget" in .htaccess. In these cases, you have to monitor IP addresses too.
SetEnvIfNoCase User-Agent "WGet" bad bot
Deny from env=bad_bot
but do some reading on the Apache forum first, in case I have the User agent wrong or have missed some syntax.