Forum Moderators: phranque

Message Too Old, No Replies

Keeping out Webreaper

Is this the best way?

         

rover

5:06 pm on Jul 7, 2004 (gmt 0)

10+ Year Member



Hi,

We keep getting visited by the webreaper spider, and I'm looking for a way to keep it out. I found the following possible solution on a web page that suggests an .htaccess file in the default directory:

RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Webreaper
RewriteRule ^.*$ /lists/ [F,L]

Does anyone know if this the best way to accomplishing this? Or is there a better solution? I just want to make sure that I don't do anything that somehow inadvertently keeps out the search engine spiders I want to visit such as googlebot, Yahoo, etc.

Also, if I already have an .htaccess file that already has an existing condition:

RewriteEngine on
RewriteRule ^pagelink/(.+)/ /cgi-local/runner.cgi?p=linker&ID=$1 [L]

Would I add the new condition for keeping out webreaper below it again in its entirety with starting again with the line

RewriteEngine On

or should I not repeat this line because it was turned on with the first condition?

Thanks in advance for any help/suggestions.

jdMorgan

5:29 pm on Jul 7, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The RewriteRule you found is mal-formed. It contains an extra field "/lists/" that will cause a server error.

Try something like this:


RewriteCond %{HTTP_USER_AGENT} ^Webreaper [NC]
RewriteRule .* [F]

Note that if you use a custom 403 error document, you will probably want to exclude it from the rewrite in order to prevent a loop:

RewriteCond %{HTTP_USER_AGENT} ^Webreaper [NC]
RewriteRule !^path_to_custom_error_document$ [F]

I added the [NC] (No case) flag because there seems to be some disagreement about how that user-agent name is capitalized.

You do not need to repeat the RewriteEngine on directive within any given .htaccess file.

Refs:
Apache mod_rewrite documentation [httpd.apache.org]
Apache URL Rewriting Guide [httpd.apache.org]
Regular Expressions Tutorial [etext.lib.virginia.edu]

A Close to perfect .htaccess ban list [webmasterworld.com] (In three parts)

Jim

rover

5:44 pm on Jul 7, 2004 (gmt 0)

10+ Year Member



Thanks that's very helpful.

RewriteRule!^path_to_custom_error_document$ [F]

If the .htaccess file is in the root directory, then when I make the path to the custom error document I shouldn't start with a slash, is that right? i.e.

RewriteRule!^directory_name/error_document.html$ [F]

rather than:

RewriteRule!^/directory_name/error_document.html$ [F]

jdMorgan

5:50 pm on Jul 7, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



This is actually just an anomaly between RewriteRule patterns in .htaccess and those in httpd.conf. For use in .htaccess, omit the leading slash as shown. For use in httpd.conf, include the leading slash in RewriteRule patterns. Always include the leading slash in URL-path patterns for RewriteCond %{REQUEST_URI}.

You can also use the RewriteBase directive to avoid this inconsistency if you like.

Jim

Bridge

10:36 pm on Jul 22, 2004 (gmt 0)

10+ Year Member



I found this really interesting.

I have several rewrites in place such as:

RewriteCond %{HTTP_REFERER} ^http://www.example.com/* [OR]
RewriteCond %{REQUEST_URI} FormMail.*
RewriteRule ^.* - [F,L]

A long list of them I also have an custom error doc in place with these in the .htaccess

ErrorDocument 400 /errors/phpErrorDoc.php?400
ErrorDocument 401 /errors/phpErrorDoc.php?401
ErrorDocument 403 /errors/phpErrorDoc.php?403
ErrorDocument 404 /errors/phpErrorDoc.php?404
ErrorDocument 500 /errors/phpErrorDoc.php?500

I'm wondering after your post if I need to add something to prevent loops. I don't even know if I have loops, although I have had to add a script which checks server laod and restarts when above 5, which can be once or twice a day.

Anyone?

[edited by: jdMorgan at 2:08 pm (utc) on July 23, 2004]
[edit reason] examplified URL per TOS [/edit]

fiestagirl

10:49 pm on Jul 22, 2004 (gmt 0)

10+ Year Member



At robotstxt.org the webreaper page suggests that webreaper follows the exclusion protocol. Have you found this to be untrue?

jdMorgan

12:22 am on Jul 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Bridge,

Simply change your rule to:


RewriteRule !^errors/phpErrorDoc\.php$ - [F]

The [L] flag, when used with [F], [G], or [P], is redundant.

Jim

Bridge

9:53 am on Jul 23, 2004 (gmt 0)

10+ Year Member



Thanks a great deal for this.

Now, just so I understand this, I'm a newbie to this.

What is actualy happening now?

Could you map that out?

RewriteCond %{HTTP_REFERER} ^http://www.example.com/* [OR]
RewriteCond %{REQUEST_URI} FormMail.*
RewriteRule!^errors/phpErrorDoc\.php$ - [F]

[edited by: jdMorgan at 2:10 pm (utc) on July 23, 2004]
[edit reason] examplified URL per TOS [/edit]

jdMorgan

2:13 pm on Jul 23, 2004 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member




RewriteCond %{HTTP_REFERER} ^http://www\.example\.com [OR]
RewriteCond %{REQUEST_URI} FormMail
RewriteRule !^errors/phpErrorDoc\.php$ - [F]

(IF the referrer starts with "http://www.example.com" OR the requested URI contains "FormMail"),
AND IF the requested local URL-path is NOT "errors/phpErrorDoc.php",
THEN leave the URL unchanged, but return a 403-Forbidden server response to the requestor.

(Also, corrected the special character escaping in the referrer and removed a bit of unnecessary fluff from the regex code).

Apache mod_rewrite documentation [httpd.apache.org]
Apache URL Rewriting Guide [httpd.apache.org]
Regular Expressions Tutorial [etext.lib.virginia.edu]

Jim

Bridge

2:22 pm on Jul 23, 2004 (gmt 0)

10+ Year Member



Many thanks.