homepage Welcome to WebmasterWorld Guest from 54.227.215.139
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
HTTack 7 WinHTTrack
yaashul

5+ Year Member



 
Msg#: 4577698 posted 9:12 am on May 25, 2013 (gmt 0)

Some person is scrapping my website completely. I search a bit and found he might be using these tools httrack and winhttrack.

What I did I blocked his IP range and these user agent via htaccess.

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} httrack [NC]
RewriteRule !^robots\.txt$ - [F]

But new httrack software support changing user agent and surf thru proxy. How can I prevent my site from these tools.

 

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4577698 posted 6:00 pm on May 25, 2013 (gmt 0)

Blocking the IP is probably a waste of time, unless it's a career scraper. Usually it's just some dimwit who doesn't know what they're doing. Or, at worst, a one-time scraper who will never set foot on your site again.

Blocking the UA is definitely a good idea. In fact HTTrack is on many people's UA-block list. Personally I do this kind of thing in mod_setenvif in the form

BrowserMatchNoCase HTTrack keep_out

followed with

Deny from env=keep_out

with a <Files "robots.txt"> envelope to override the block.

But new httrack software support changing user agent and surf thru proxy. How can I prevent my site from these tools.

Scraping tools also give lip service to robots.txt-- and then give users the option of ignoring robots.txt.

Some people block all proxies. You can lock out the known proxies by IP. Some also send the X-Forwarded-For header which you can block independently:
SetEnvIf X-Forwarded-For . keep_out
meaning "if this header exists at all..." (Careful! Google Preview also sends this header, though That Other Search Engine doesn't. You may or may not consider this an asset.)

yaashul

5+ Year Member



 
Msg#: 4577698 posted 7:03 pm on May 25, 2013 (gmt 0)

If I consider this option of keeping these proxies out. How much this Google Preview tool have negative effect on my site page rank and other seo related things?

yaashul

5+ Year Member



 
Msg#: 4577698 posted 3:00 am on May 26, 2013 (gmt 0)

Lucy24,

I tried searching for the solution u gave. I find how to block using browsermatchnocase

BrowserMatchNoCase "^Lynx" banned=1
Order Deny,Allow
Deny from env=banned
Allow from env=permitted

But not able to allow Robots.txt. Can you give me the solution for that?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4577698 posted 4:53 am on May 26, 2013 (gmt 0)

But not able to allow Robots.txt

You just need a <Files> envelope as I said above. (Disclaimer: "Envelope" is not the proper technical term. If there is one, I've got a mental block on it.) Mine goes

<Files "robots.txt">
Order Allow,Deny
Allow from all
</Files>

Anything inside a <Files> or <FilesMatch> envelope will override anything outside the envelope, in the same way that a deeper htaccess will generally override a higher/earlier one. Here, the <Files> envelope overrides any Allow/Deny directives that apply generically to all files.

Note that this only works within modules. If you've locked someone out via mod_rewrite, an "Allow from all" directive in Files or FilesMatch will not let them in. Unless you have nested Rewrites,* which are not for the faint of heart. This need not be an issue, because you can constrain RewriteRules to specific extensions as needed.

Are you sure you want ^Lynx with opening anchor? Then you're only blocking visitors with "Lynx" at the very beginning of their UA string.

:: shuffling papers ::

It does seem to come first in the few I checked-- including myself on the test site. If "Lynx" --or any other element you're testing for-- normally comes at the beginning of the UA string, then do keep the anchor. Saves work for the server. (This is a pretty general principle with Regular Expressions.)

Most of the time, Allow/Deny directives will say either
Allow from all
Deny from ... {itemize these}
or
Deny from all
Allow from ... {itemize again}
depending on whether you're blacklisting or whitelisting.

Yes, BrowserMatch(NoCase) is a good shortcut in mod_setenvif.

:: looking vaguely around for someone else to weigh in on SEO (dis)advantages of Google Preview though it's possible there is no Final Word yet since it's only been around for a year or two ::


* I've got a fistful of them-- but only because nobody told me you're not supposed to.

yaashul

5+ Year Member



 
Msg#: 4577698 posted 5:00 am on May 26, 2013 (gmt 0)

lucy24,

Few more thing I need to ask. How to handle space between user agent. Like If i want to block (Web Image Collector) without brackets how to do that Just inverted comma will work?

Do I need to mention ^ sign?

Is there any way to present an actual 403 pages rather apache configuration page for the bots blocked?

yaashul

5+ Year Member



 
Msg#: 4577698 posted 5:16 am on May 26, 2013 (gmt 0)

One more thing I would like to know how to block blank user agent using this method?

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4577698 posted 6:02 am on May 26, 2013 (gmt 0)

In mod_setenvif, you can put text inside quotation marks to protect literal spaces. (In some mods, you have to \ escape the space instead.) As in:

BrowserMatch ^-?$ keep_out
BrowserMatch Ahrefs keep_out
BrowserMatch "America Online Browser" keep_out

et cetera.

The question mark in the first line is for insurance. Any blank shows up in logs as a single - (because the log has to put something there) so to cover myself I let Apache decide for itself if it wants to call a blank user-agent ^$ or ^-$

Do I need to mention ^ sign?

That's an opening anchor. See earlier post about Lynx. If you're certain that something will always come at the very beginning of your test string, use an anchor. Then the server can stop looking right away, and be out of there all the sooner. But if the text might come later on, leave off the anchor.

Closing anchors aren't as common in mod_setenvif, except in cases like the empty user-agent where you want to say "this is the entire string from beginning to end".

Is there any way to present an actual 403 pages rather apache configuration page for the bots blocked?

By "apache configuration page" do you mean some type of apache error message?

Crystal ball says you have a custom 403 page. If so, here is what happens:

bad robot asks for page
htaccess says "Nuh-uh, you can't have it"
server asks for 403 page instead
htaccess says "No, that's a bad robot, it isn't allowed to see any pages"
server says "Oh, in that case let me show it the 403 page instead"
htaccess says "I can't let the bad robot see any pages"
server says "I give up!" and shows the Apache default page instead.

So you need another envelope, like this:

<FilesMatch "(forbidden|goaway|missing|sorry)\.html$">
Order Allow,Deny
Allow from all
</FilesMatch>

That's my own version. It's actually even longer; I just showed some of my error pages here. If you only have one error page, you can do it with <Files> instead. It's the same principle as robots.txt: you need to allow everyone to see the page.

yaashul

5+ Year Member



 
Msg#: 4577698 posted 6:09 am on May 26, 2013 (gmt 0)

Lucy24,

Than I have to create those file
forbidden.html
goaway.html
missing.html
sorry.html

and how to link those with there http error codes?

yaashul

5+ Year Member



 
Msg#: 4577698 posted 6:27 am on May 26, 2013 (gmt 0)

Currently I use this method.

ErrorDocument 400 /error.php
ErrorDocument 401 /error.php
ErrorDocument 403 /error.php
ErrorDocument 404 /error.php
ErrorDocument 500 /error.php


and error.php file look like


<?php

$page_redirected_from = $_SERVER['REQUEST_URI']; // this is especially useful with error 404 to indicate the missing page.
$server_url = "http://" . $_SERVER["SERVER_NAME"] . "/";
$redirect_url = $_SERVER["REDIRECT_URL"];
$redirect_url_array = parse_url($redirect_url);
$end_of_path = strrchr($redirect_url_array["path"], "/");

switch(getenv("REDIRECT_STATUS"))
{
# "400 - Bad Request"
case 400:
$error_code = "400 - Bad Request";
$explanation = "The syntax of the URL submitted by your browser could not be understood. Please verify the address and try again.";
$redirect_to = "";
break;

# "401 - Unauthorized"
case 401:
$error_code = "401 - Unauthorized";
$explanation = "This section requires a password or is otherwise protected. If you feel you have reached this page in error, please return to the login page and try again, or contact the webmaster if you continue to have problems.";
$redirect_to = "";
break;

# "403 - Forbidden"
case 403:
$error_code = "403 - Forbidden";
$explanation = "This section requires a password or is otherwise protected. If you feel you have reached this page in error, please return to the login page and try again, or contact the webmaster if you continue to have problems.";
$redirect_to = "";
break;

# "404 - Not Found"
case 404:
$error_code = "404 - Not Found";
$explanation = "The requested resource '" . $page_redirected_from . "' could not be found on this server. Please verify the address and try again.";
$redirect_to = "";
break;

# "500 - Internal Server Error"
case 500:
$error_code = "500 - Internal Server Error";
$explanation = "The server experienced an unexpected error. Please verify the address and try again.";
$redirect_to = "";
break;
}
?>

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Transitional//EN"
"http://www.w3.org/TR/xhtml1/DTD/xhtml1-transitional.dtd">
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<link rel="Shortcut Icon" href="/favicon.ico" type="image/x-icon" />
<title><?php print ($error_code); ?></title>
</head>
<body>

<h1>Error Code <?php print ($error_code); ?></h1>

<p>The <a href="<?php print ($page_redirected_from); ?>">URL</a> you requested was not found. <?PHP echo($explanation); ?></p>

<p>You may also want to try starting from the home page: <a href="<?php print ($server_url); ?>"><?php print ($server_url); ?></a></p>

<hr />

<p><i>A project of <a href="<?php print ($server_url); ?>"><?php print ($server_url); ?></a>.</i></p>

</body>
</html>


This code is from dreamhost wiki page.

lucy24

WebmasterWorld Senior Member lucy24 us a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



 
Msg#: 4577698 posted 7:23 am on May 26, 2013 (gmt 0)

Than I have to create those file

No, no, I'm just quoting my own htaccess as an example. Substitute the name of your own custom 403 page-- "error.php" if that's what it is.

But why don't you just use static html pages? Even if the error originated in php -- as with a dynamic page that has to figure out if its parameters are valid before knowing if it can build the html -- you can still simply "include" your ordinary error page.

If there's already an error condition, seems like the last thing you want to do is put the server to even more work. And you surely don't want to waste time building pages for a robot that will not even stick around to read them :)

yaashul

5+ Year Member



 
Msg#: 4577698 posted 7:32 am on May 26, 2013 (gmt 0)

lucy24,

Thanks a lot for solving my problems :)

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved