|Is a grabber a browser?|
are they good or bad?
I have a question, but don't know if it belongs here. I just started noticing on my stats that a grabber is listed consistently as a browser, on my list of browsers on my stats page. "Wget" is the name of it. Are grabbers really browsers and are they good? If not, is there any way to disallow on, say, a robot.txt page or something? Forgive me if this is a dumb question, but I can't find the answer anywhere. Thanks.
This is a tool used to make an entire copy of a website (pages, graphics, everything) from a given web address.
I personally block these with great prejudice.
Your stats page says "browsers" but what it really means is "user agents", which is the term to describe all programs people can use to request a web page. Browsers are the most commonly used user agents, a grabber is just a differnt type of user agent.
I'd consider blocking grabbers from some sites as they can use a lot of bandwidth indiscriminately and they help people copy all your content easily. Of course it depends on what your site is for: there are some kinds of sites (e.g. reference sites for hobbyists) that you might WANT people to be able to copy to their local machines.
Thanks for the info. I'm fairly new to the game and have only started looking hard at my stats. If I want to block one bad browser listed as a grabber, can I do it in a robot.txt file or should I specify in .htaccess file?
htaccess is the method. bad bots etc. tend to ignore your robots.txt
|If I want to block one bad browser listed as a grabber, can I do it in a robot.txt file or should I specify in .htaccess file? |
Before you start losing sleep, bear in mind that usually the "user-agent" or identifying string can be changed, so for example our own robot which we use to locate useful web sites, declares itself as what it is, but could just as easily claim to be Internet Explorer, or Opera or whatever. The same is true of commercial web site copying software, they usually have built in methods for bypassing your attempts to block them - I'm not saying this is good, just that there's not much you can do about it.
|GNU Wget is a free software package for retrieving files using HTTP, HTTPS and FTP, the most widely-used Internet protocols. It is a non-interactive commandline tool, so it may easily be called from scripts, cron jobs, terminals without X-Windows support, etc. |
It is an incredibly useful tool, but if it is consistently being (mis)used by a third party to download your site then you should attempt to block it. In spidering mode it obeys robots.txt (User agent: Wget) unless the user explicitly disables the option. The user can also change the user agent identification string, so a really determined person can still rip your site with wget even if you block the string "wget" in .htaccess. In these cases, you have to monitor IP addresses too.
Thanks (guys), good info.
I just noticed Wget in my stats, even though it's always been blocked in my robots.txt files. It says it has made about 30 hits this month, but then it says "0%" in the column marked "percent". Does this mean it's hitting my robots.txt file and then getting blocked? Or is it getting through, and is there more I should do to stop it?
Wget is written to adhere to robots.txt. Unfortunately there are variations/work arounds to allow it to download whole sub-directories and ignore robots instructions.
If it's pulling 0% down then you are ok.
Hmm. On a couple of my other sites, it's pulling 2%, 9%... anything I can do about that? Beyond having it in my robots.txt file, obviously.
If you have access to your .htaccess file (and have Apache) then you could try
SetEnvIfNoCase User-Agent "WGet" bad bot
Deny from env=bad_bot
but do some reading on the Apache forum first, in case I have the User agent wrong or have missed some syntax.