Forum Moderators: phranque

Message Too Old, No Replies

Help with site rippers

HTTrack off-line browser, site ripping, htaccess

         

aperturedelight

7:47 am on Jul 21, 2009 (gmt 0)

10+ Year Member



Hi,

I am wondering if anyone can help me with how I stop HTTRACK

I have tried putting this in htaccess:

RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^httrack.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^httrack* [OR]
RewriteRule ^.* - [F]

It isn't working and so it's costing me around 5GB a day in bandwidth!

And help would be very much appreciated.

Thanks
Steve

[edited by: jdMorgan at 2:44 pm (utc) on July 21, 2009]
[edit reason] No URLs, please. [/edit]

vincevincevince

7:51 am on Jul 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Options +FollowSymLinks
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteRule ^.* - [F]

aperturedelight

8:04 am on Jul 21, 2009 (gmt 0)

10+ Year Member



thanks vince.

this htaccess stuff bends my mind!

Steve

jdMorgan

2:44 pm on Jul 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In a 'list' of [OR]ed RewriteConds, the last RewriteCond must not have an [OR] flag (You cannot [OR] a RewriteCond with a RewriteRule).

So the single-RewriteCond above won't likely work as expected unless the [OR] is removed.

Jim

aperturedelight

3:24 pm on Jul 21, 2009 (gmt 0)

10+ Year Member



Thanks. I have changed my htaccess file now and will check the logs tomorrow to see if it has stopped them doing this. Hopefully it will have.

Steve

wilderness

3:40 pm on Jul 21, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



I have changed my htaccess file now and will check the logs tomorrow to see if it has stopped them doing this. Hopefully it will have.

Just a word of caution!

After making changes to htaccess, you should ALWAYS and IMMEDIATELY check your site (s) to assure they function and do not return a 500 error (site not working) due to a syntax error.

aperturedelight

6:34 am on Jul 24, 2009 (gmt 0)

10+ Year Member



Hi Guys,

I have been monitoring my logs and this hasn't stopped HTTrack from downloading my site! :(

They were back again last night..
"HTTrack off-line browser 7.03 GB 23 Jul 2009 - 22:35"

Does anyone have any other ideas suggestions on how I can stop this? I was wondering about bandwidth limiting? Is that possible?

Thanks In Advance
Steve

jdMorgan

1:14 pm on Jul 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



You will still see the HTTrack requests in your access log file, but you should see a 403-Forbidden response instead of a 200-OK or 304-Not Modified response.

Before getting too excited, let's be sure the code is correct and that mod_rewrite is actually enabled. I noted a missing directive in the code above, so I'd suggest:


Options +FollowSymLinks
RewriteEngine on
#
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC]
RewriteRule ^ - [F]

After installing that, use "Live HTTP Headers" and "User-Agent Switcher," "Mozilla PrefBar," or a similar add-on for Firefox/Mozilla to switch your user-agent to HTTTrack and test this rule. You should see a 403-Forbidden response.

If you don't want to install these add-ons, then you could use an on-line user-agent spoofer like WannaBrowser, although results with such on-line tools are aometimes inconsistent (especially when redirects are involved).

Also note that if you use a custom 403 error page, then the URL-path of that page will need to be excluded from the rule above. Otherwise, you'll get a 403 error loop.

Jim

aperturedelight

7:23 pm on Jul 24, 2009 (gmt 0)

10+ Year Member



Hi Jim,

Thanks, this htaccess stuff really is beyond my understanding! I have followed your advice and can see the access to HTTrack being blocked now so that's reassuring. :) Thank you.

As you have noted I do have a 403 custom error page loop but not sure how I would deal with this as my error pages are served via server configuration which is administered by my server host i.e. the url doesn't actually change when the error page is displayed.

I think I would prefer to just redirect these rippers back to Google if I knew how to!

Thanks so much for you help.

Steve

jdMorgan

8:23 pm on Jul 24, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Rippers don't follow redirects and if they did, you'd just be passing on a problem and causing the waste of even more internet bandwidth, so just 403 them and be done with it.

In order to cure the 403 loop problem, you'll need to find the local path of the custom 403 page. Or perhaps you might want to replace it with your own custom 403 page by using the ErrorDocument directive, in which case, you can define the path youself.

At any rate, symbolically, the cure for the loop (and another problem) is:


Options +FollowSymLinks
RewriteEngine on
#
RewriteCond %{REQUEST_URI} !^/(robots\.txt¦[i]<path-to-custom-403-page\.html>[/i])$
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC]
RewriteRule ^ - [F]

Important: Replace the broken pipe character with a solid pipe character before use; Posting on this forum modifies the pipe character.

For more information about mod_rewrite and regular expressions patterns, see the resources cited in our Apache Forum Charter (link at top of this page).

Jim