I wanted to use a stronger word, but there are grownups reading these forums.
HTTrack shows up a good bit in WebmasterWorld, but usually not in the UA-identification context. UA
(details probably irrelevant): Mozilla/4.5* (compatible; HTTrack 3.0x; Windows 98) IP
(almost certainly irrelevant): 176.74.192.nn (in Sweden, of all places) Files:
185 in a bit under 2 minutes, including 46 html files repeated with HTTP 1.0 instead of HTTP 1.1. In spurts and hiccups, so generally 4-5/second. Only one from a roboted-out directory (the others require at least four recursions of links, and they stopped at two). Referer:
generally my front page, even when picking up interior files robots.txt:
Don't be silly.
They have a www site
[httrack.com] that answers all your questions, starting with the basic
Q: Some sites are captured very well, other aren't. Why?
A: There are several reasons (and solutions) for a mirror to fail. Reading the log files (ans [sic] this FAQ!) is generally a VERY good idea to figure out what occured.
* Website 'robots.txt' rules forbide [sic again] access to several website parts - you can disable them, but only with great care!
* HTTrack is filtered (by its default User-agent IDentity [sic no. 3]) - you can change the Browser User-Agent identity to an anonymous one (MSIE, Netscape..) - here again, use this option with care, as this measure might have been put to avoid some bandwidth abuse (see also the abuse faq!)
If you want to hide, give your UA as Netscape. Nobody will ever
notice you then.
Wait, don't rush off to have your jaw wired back into place just yet. The Abuse FAQ
[httrack.com] is even better. Very educational.
I guess I must be getting bigger; I have never met these guys before yesterday.
Don't know about the rest of y'all. But as far as I'm concerned, if someone chooses to copy my site for personal offline reading (the official reason for using this program), they can bloody well do it the way I do. Manually load up the desired pages, save them with your browser... and then sneer at the HTML before quietly cleaning it up.
* I almost missed this detail. Mozilla what