Forum Moderators: DixonJones

Message Too Old, No Replies

HTTrack sitegrabber

grabbed my whole site, including dynamic URLs

         

ppg

12:28 pm on Sep 17, 2002 (gmt 0)

10+ Year Member



I dont know if this is a problem, but I found in my logs that someone (or something) had been using HTTrack to grab my entire site.

a reverse dns lookup revealed the following host: cache-ink2-cro-hsi.cableinet.co.uk.

cableinet is the old name for blueyonder, a UK cable ISP (i.e. telewest).

Anyone got any idea what's going on here? would you advise me to restrict this kind of activity?

It did request my robots.txt
thanks Paul

BlobFisk

1:11 pm on Sep 17, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



As far as I can remember, this is a piece of software that allows a user to rip the entire contents of a website.

It's works the same was as the IE File>Save As, but quicker, insofar as that it downloads all .html, .js, .gif, .jpg etc. files (based on the options set) from your site.

I don't know of any way of stopping is, as it uses the standard http port 80. The program works by examining the source code of the page and then requesting and downloading all assets. It then follows all links and does the same on those pages. I'm not sure what it does with server side pages though (.asp, .jsp, .php etc.).

4eyes

1:13 pm on Sep 17, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



HTTrack is a PC based site grabber.

Someone with a Blueyonder account has used it to take a copy of your entire site.

This might be because they like it so much and want to browse it offline. Or it might be because they intend to clone it in some way.

Either way, its not nice.

If I remember correctly, HTTrack can be configured to ignore the robots.txt, but you should be able to block its user agent, which at least would make it more difficult for them.

Mikkel Svendsen

1:34 pm on Sep 17, 2002 (gmt 0)

10+ Year Member



I would say it depends on what kind of site you have. One of the sites I run is a free information site - the goal is to have people read my articles, so for that site I do not care if they download it for off-line browsing.

However, if your site makes a living from advertising you may not like it, or if your site is serving dynamic information that cange a lot (like a shop). There are probably a bunch of other reasons not to allow off line browsing - but my point was just that it is not allways bad. It depends on the purpose of the site :)

ppg

1:53 pm on Sep 17, 2002 (gmt 0)

10+ Year Member



hm. I doubt if the site is interesting enough for some-one to want it offline, so I guess there's something nefarious going on. The biggest part of it is the Cisco global price list served up by JSP's (which HTTrack can grab as I just found out). I suppose this information may be of interest to some-one but then why grab the whole site? Why not just this section?

If HTTrack can be configured to ignore robots.txt I guess the best way for me to ban it is with .htacces?

mikie

3:56 pm on Sep 21, 2002 (gmt 0)

10+ Year Member



I don't know much amount HTTrack, but I have a Web Site Downloader program that is similar to it. I use it for legitimate reason, and with permission. I have found that anything under https:// is blocked. Some sites redirect from http:// to https:// to block programs, such as, HTTrack.

dcheney

4:58 pm on Sep 21, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I also have a large non-commercial site and a number of folks have made copies of it. I don't mind as long as its for non-commercial use - except for the idiots who set their grabber up for 200pages/minute - which seems fairly common :(

chiyo

6:08 pm on Sep 21, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



'Why is it "not nice" to download a site for off line browsing. I do it all the time and I think Im nice :)

At slow connection speeds and per minute phone plus internet charges, no way im going to browse sites with the clock ticking away.

There is nothing wrong with downloading material from the web. Google does it all the time. It is how it is used where a problem may occur. if its available on the net, people will view it (and download the whole like) if they like it. As long as its for personal use, no problem at all.

dcheney

7:09 pm on Sep 21, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The reason I don't like it is simple: when one of the 200pages/minute idiot grabs my site it is very hard for anyone else to use it for the next 3 hours until the idiot is done.

To me that is exactly the same as DOS attack.

Mikkel Svendsen

8:22 pm on Sep 21, 2002 (gmt 0)

10+ Year Member



I agree that it looks like there is no reason for grabbing this - but if your webserver suffer from 200 page views a minute it can't be many users you have :) - Sorry for teasing you, but I just never saw a webserver having problems with only 200 page views a minute unless the website was on a over-loaded shared server

dcheney

10:00 pm on Sep 21, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The server can handle it (seems to anyway) but I still would consider that an attack. Just to make it clear - I'm not talking about this going on for 5 minutes - but for 3 hours or more.