Forum Moderators: phranque

Message Too Old, No Replies

Bad Robots

Help me keep this robot off our site!

         

mkhines

2:35 pm on Dec 20, 2006 (gmt 0)

10+ Year Member



I'm trying to block a robot who is scouring our site, to the point where it
goes down. It is ignoring our robots.txt of course.

in our logs it shows up as

101057-web2.gold.funnelback.com

and they go over every link on every page, and on one specific page, gets
caught in a bunch of keyword links, of which each trigger a query to our
database. Several times a day it's taking our site completely to 503.

I've tried to add this code to httpd.conf (we're running Apache 2 on Windows
with Tomcat)

but when I start Apache again, it refuses to go.

Is there a problem with this code below?

RewriteEngine on (this code is already there)

#RewriteCond %{HTTP_USER_AGENT} ^*.funnelback.*$ (new code which Apache
doesn't like and won't start)
#RewriteRule .* - [F,L] (new code which Apache doesn't like and won't start)

ReWriteCond %{REQUEST_METHOD} ^TRACE (this code already there)
RewriteRule .* - [F] (this code already there)

Please help!

Megan

jdMorgan

4:04 pm on Dec 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Megan,

Welcome to WebmasterWorld!

There are several problems here. The first is that you cannot "guess" mod_rewrite code. To do so is dangerous to the health of your server, as you've discovered with the failed restarts.

The specific problem is that the patterns used by mod_rewrite directives are written using regular expressions - a standard text-pattern-matching "language" used in PHP, PERL, and many other scripting languages. It in no way resembles DOS command line patterns such as "*.*" meaning "all files" in DOS. A concise regular expressions tutorial is cited in our forum charter.

In regular expressions, ".*" (Note: Not "*.") means "match any number (including zero) of any characters."

The next problem is that "funnelback" does not appear to be a user-agent. Instead, it looks like a hostname which your server has looked up using reverse DNS (RDNS). So the first step is to properly identify the user-agent (if it provides one when it accesses your site), and then figure out the correct regular expression to match that user-agent and, optionally, all of its potential versions and agent-name variants.

A minor point is that [L] used with [F] is redundant.

Note that the use of "^.*" and ".*$" in regular-expressions patterns is almost always unnecessary and wasteful of bytes on disk and CPU time. In order to understand this, you'll need to become familiar with the concept of "anchoring" in regular-expressions patterns. Patterns may be ^start-anchored, requiring the input string to start with a certain pattern. Or they may be end-anchored$, requiring the input string to end with a certain pattern. Building on that, you can use both a ^start and an end anchor$, which requires the input string to exactly match the pattern, or you may omit both anchors, which requires only that the input string contain text matching the pattern.

So, job #1 here is to review your raw server access logs, using the time of the accesses by this agent to find the relevant entries, and identify the user-agent string if there is one. Then we can proceed to implementing a solution. If this intruder does not provide a user-agent string, then other options are available, such as blocking it by IP address or by IP address range.

For more information, see the documents cited in our forum charter [webmasterworld.com] and the tutorials in the Apache forum section of the WebmasterWorld library [webmasterworld.com].

Jim

mkhines

4:18 pm on Dec 20, 2006 (gmt 0)

10+ Year Member



Thank you Jim for your help, in looking at the server access and referrer logs, this is what appears -

101057-web2.gold.funnelback.com - - [20/Dec/2006:10:08:06 -0600] "GET /url on my site" 200 53213

The referrer entries are blank for each of its requests. I'm not sure exactly what that means, are they purposely hiding this information so we can't block them?

Just to let you know too, that I am testing the rewrites on a development machine before putting them on the server. This was my last attempt after reading around on these forums, but they're still getting through.

ReWriteCond %{HTTP_REFERER} ^101057-web2.gold.funnelback.com*$
RewriteRule .* - [L]

This doesn't work either. After reading your message - and discovering that there is no referrer listed in the logs - I understand why.

What is the correct code to block by IP? If the IP isn't in the referrer logs either, will that not work? I used a host name to IP lookup website and this is the IP that it finds - 64.72.112.53

Hmm.

jdMorgan

4:59 pm on Dec 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



> What is the correct code to block by IP? If the IP isn't in the referrer logs either, will that not work? I used a host name to IP lookup website and this is the IP that it finds - 64.72.112.53

You'll need several changes to that code -- again, I recommend a review of the regular-expressions tutorial and the other material cited in our forum charter.


RewriteCond %{REMOTE_ADDR} ^64\.72\.112\.53$
RewriteRule .* - [F]

To exclude a broader range, in case they have more than one server, you could use:

RewriteCond %{REMOTE_ADDR} ^64\.72\.1(1[2-9]¦2[0-7])\.
RewriteRule .* - [F]

which will block 64.72.112\. through 64.72.127.255

Hopefully, that will take care of the immediate problem.

Note that you must change the broken pipe "¦" character in the code above to a solid pipe before use; Posting on this forum modifies the pipe character.

Jim

jdMorgan

5:05 pm on Dec 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



One addendum:

IF you use a custom 403 error document, you will need to exclude it from the rule. Otherwise, your server will generate a second 403-Forbidden response when it tries to serve the 403 error document. Example:


RewriteCond %{REQUEST_URI} !^/local_path_to_custom_403_error_document\.html$
RewriteCond %{REMOTE_ADDR} ^64\.72\.112\.53$
RewriteRule .* - [F]

The local URL-path to the error document will be the same as given in the ErrorDocument directive, e.g.

ErrorDocument 403 /local_path_to_custom_403_error_document\.html

Note that this path *must not* start with a protocol and domain name such as "http://www.example.com" in either the ErrorDocument or the the RewriteCond directive -- It must be a local path only.

Jim

mkhines

5:22 pm on Dec 20, 2006 (gmt 0)

10+ Year Member



Thanks again Jim,

I tried to block by the IP with your suggestion (with the range of addresses) and it won't stop them.

I imagine this could be because this is what they leave in the referrer logs?

- -> /advsearch.jsp
- -> /advsearch.jsp
- -> /advsearch.jsp

without any information where there is usually information about where they come from.

I'm at a loss as what to try. Aside from holding myself back from trying to contact them directly and begging, pleading, screaming.

Megan

jdMorgan

5:27 pm on Dec 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



May we see a fresh log entry again, please?

The code won't stop their requests from being logged. But you should see them receive a 403-Forbidden response code instead of a 200-OK, a 304-Not Modified, 404-Not Found, or 410-Gone response, etc.

Jim

mkhines

5:51 pm on Dec 20, 2006 (gmt 0)

10+ Year Member



Hot off the press

This is the access file

101057-web2.gold.funnelback.com - - [20/Dec/2006:11:47:30 -0600] "GET /advsearch.jsp?search=captive%20deer%20Chronic%20wasting%20disease%20Prion%20protein%20Rocky%20Mountain%20elk
&filter=&sortBy=1&sortDir=1&pagemode=advsearch HTTP/1.1" 200 74833
101057-web2.gold.funnelback.com - - [20/Dec/2006:11:47:43 -0600] "GET /advsearch.jsp?search=captive%20deer%20Chronic%20wasting%20disease%20Prion%20protein%20Bison%20buffalo
&filter=&sortBy=1&sortDir=1&pagemode=advsearch HTTP/1.1" 200 50712
101057-web2.gold.funnelback.com - - [20/Dec/2006:11:47:56 -0600] "GET /advsearch.jsp?search=captive%20deer%20Chronic%20wasting%20disease%20Prion%20protein%20Parasitic%20diseases
&filter=&sortBy=1&sortDir=1&pagemode=advsearch HTTP/1.1" 200 53477
101057-web2.gold.funnelback.com - - [20/Dec/2006:11:48:15 -0600] "GET /advsearch.jsp?search=captive%20deer%20Chronic%20wasting%20disease%20Prion%20protein%20Parasitology
&filter=&sortBy=1&sortDir=1&pagemode=advsearch HTTP/1.1" 200 54083
101057-web2.gold.funnelback.com - - [20/Dec/2006:11:48:30 -0600] "GET /advsearch.jsp?search=captive%20deer%20Chronic%20wasting%20disease%20Prion%20protein%20Viral%20diseases
&filter=&sortBy=1&sortDir=1&pagemode=advsearch HTTP/1.1" 200 54577
101057-web2.gold.funnelback.com - - [20/Dec/2006:11:48:45 -0600] "GET /advsearch.jsp?search=captive%20deer%20Chronic%20wasting%20disease%20Colorado%20Brain
&filter=&sortBy=1&sortDir=1&pagemode=advsearch HTTP/1.1" 200 62517

Showing 200 (success)

Here is the rewrite section in the httpd.conf

RewriteCond %{REMOTE_ADDR} ^64\.72\.1(1[2-9]¦2[0-7])\.
RewriteRule .* - [F]

And here is a referrer file snip

- -> /advsearch.jsp
- -> /advsearch.jsp
- -> /advsearch.jsp
- -> /advsearch.jsp

and we are logging "normal" referrers as well. But nothing that points to this funnelback

[edited by: jdMorgan at 6:50 pm (utc) on Dec. 20, 2006]
[edit reason] Stop side-scroll [/edit]

jdMorgan

6:06 pm on Dec 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Did you read and heed this?

Note that you must change the broken pipe "¦" character in the code above to a solid pipe before use; Posting on this forum modifies the pipe character.

Jim

mkhines

6:28 pm on Dec 20, 2006 (gmt 0)

10+ Year Member



Yep, I had changed the pipe..

I just tried this after talking to a colleague and it didn't work either.

RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{HTTP_USER_AGENT} ^$
RewriteCond %{REQUEST_URI} ^/$
RewriteRule .* - [F]

Any blank referrer, or user agent, requesting anything under the root should be blocked, correct?

And tried this to try and specify the page itself - no dice.

RewriteCond %{HTTP_REFERER} ^$
RewriteCond %{REQUEST_URI} ^/advsearch.*$
RewriteRule .* - [F]

mkhines

6:52 pm on Dec 20, 2006 (gmt 0)

10+ Year Member



Jim,

You have been correct all along and we had the code in the incorrect space in the httpd file.

Thanks for all your help!

Megan

jdMorgan

6:53 pm on Dec 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Well, either you're missing the RewriteEngine on directive, or the "scope" of application is incorrect -- That is, the code is in a container such as <Directory> or <Location> or even <VirtualHost> that is not right.

I'm peering through a tiny window here, and can't see much.

Jim

mkhines

7:00 pm on Dec 20, 2006 (gmt 0)

10+ Year Member



Yes it should have been within the virtual host section and not in the document.

:)
Thanks.

jdMorgan

7:08 pm on Dec 20, 2006 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Cross-posted, but we were getting to that... :)

Note the disconnect between the description of the code you tried and the code itself:

> Any blank referrer, or user agent, requesting anything under the root should be blocked, correct?

This isn't a good idea, since it would block a lot of innocent and legitimate users coming to your site from behind ISP and corporate proxies, but in order to work at all, it needs to be written as described above:


# Any blank referrer
RewriteCond %{HTTP_REFERER} ^$ [b][OR][/b]
# [b]OR[/b] any blank user-agent
RewriteCond %{HTTP_USER_AGENT} ^$
# requesting any URL-path
RewriteRule .* - [F]

Without the [OR] flag, the default operator is "AND," so both RewriteConds would have had to have been true for the rule to be applied.

The last RewriteCond was redundant, since it was already implicit in the RewriteRule pattern, and also, that RewriteCond's pattern was fully-anchored, so the rule would have only affected requests for "example.com/" and only "example.com/" -- nothing below that. So I removed it.

Glad you got it working!

If you have access to the firewall, let your code run for a few days while observing this abuser, and then once you're sure you've got it covered, you can move the blocking function from mod_rewrite in httpd.conf to the ACL in your firewall. It will keep those requests from even connecting to your server -- and save you some space in your log files. Firewall stuff is beyond the scope of this forum, but it's something to consider/research.

Jim