Forum Moderators: phranque

Message Too Old, No Replies

Yet another Rewrite-rule request for help

Thought I had mod_rewrite all sussed out... how wrong I was

         

AlexK

3:04 pm on Apr 27, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The following is in httpd.conf, immediately *above* all virtual-hosts. The idea, of course, is to have some re-write rules that will apply to all accesses.

I thought that these were working but, sure enough, someone has been able to (try to) scrape the site using the DKIMRepBot. The scraper got stopped by my site PHP stop-bad-bots routine. The following is supposed to stop such folks from even reaching those routines:

#
# server-wide rewrite directives
# 2007-04-13 added -AK
#
<IfModule mod_rewrite.c>
RewriteEngine On
RewriteLog logs/rewrite_log
RewriteLogLevel 0
#
# 2009-04-18 added LocalBot (public web scraper)
# 2008-03-26 added MJ12bot (ubiquitous public web scraper)
# reject requests using ``PuxaRapido' (badly written file scraper)
# 2007-04-13 added -AK
#
RewriteCond %{HTTP_USER_AGENT) .
RewriteCond %{HTTP_USER_AGENT) DKIMRepBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT) LocalBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT) MJ12bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT) PuxaRapido [NC]
RewriteRule ^.* - [F,L]
</IfModule>
# End of server-wide rewrite directives

Something is wrong with the routine as written... but be blowed if I can see it.

This is what the bot UA looks like:

Mozilla/5.0 (compatible; DKIMRepBot/1.0; +http://www.bot.tld)

Can anyone see where the code goes wrong?

jdMorgan

3:27 pm on Apr 27, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Do you have FollowSymLinks or SymLinksIfOwnerMatch enabled in the context of this code, as required to enable mod_rewrite?

Note that [L] used with [F] is redundant, because [F] implies [L].

See the Apache mod_rewrite documentation for more info on these two points.

Also note that MJ12bot is not a scraper by definition; It is a distributed robot which can be deployed by individuals or organizations for good or evil. It's more fair to determine the good/evil status of any particular instance of this 'bot by checking to see if it reads and complies with robots.txt. (I'm not disputing your decision to block it here, just your blanket classification of this'bot as a "scraper.")

The MJ12 robot's author occasionally participates in the Search Engine Spider Identification forum here at WebmasterWorld.

One more thing: Be sure that when you get it working, this code does not interfere with the serving of custom 403 error pages. You don't want to return a 403-Forbidden response when a custom 403 page is served, because the result will be yet another 403! This creates an infinite loop, and can amount to s "self-inflicted denial-of-service attack." This may not be a problem when the code resides in a server config file, but do test it; It is a very common problem when such code is deployed in .htaccess files. The solution, if needed, is to exclude custom 403 error pages from being forbidden. That's easy if you adopt and/or enforce a common URL-path for custom 403 pages, but may require checking %{REDIRECT_STATUS} or a similar variable if you cannot know the name of the custom 403 page in advance.

Jim

AlexK

4:57 am on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Hello Jim - thanks for a quick response.

Well, I've gone through the documentation [httpd.apache.org] for the umpteenth time. I seem to read mod_rewrite docs more often than my pension plan.

I did not realise that there were any external requirements, so thanks for that:

`FollowSymLinks [httpd.apache.org]' does not apply to my situation - .htaccess files need it, httpd.conf does not (in any case, it *is* enabled within context of these rules).

As best as I can tell, a small section on Virtual Hosts [httpd.apache.org] explains this problem:

By default, mod_rewrite configuration settings from the main server context are not inherited by virtual hosts. To make the main server settings apply to virtual hosts, you must place the following directives in each <VirtualHost> section:
RewriteEngine On
RewriteOptions Inherit

Then the info on RewriteOptions [httpd.apache.org]:
inherit
This forces the current configuration to inherit the configuration of the parent. In per-virtual-server context, this means that the maps, conditions and rules of the main server are inherited. In per-directory context this means that conditions and rules of the parent directory's .htaccess configuration are inherited.
(my emphasis)

The above is not crystal clear for me, but it seems to say that main-config mod_rewrite sections will be ignored unless a virtual-server includes a `RewriteOptions Inherit' directive. At least, that would make sense of my reality.

Jim:

Also note that MJ12bot is not a scraper by definition

Please permit me to disagree. By my definition, `Google' is a scraper, as are all the others. The difference is that `Google' is a well-behaved scraper that pays back handsomely for the bandwidth & server load that it consumes. By contrast, `DKIMRepBot' was used on my site at 15 hits/sec. `MJ12bot' was used at up to 10 hits/sec across many days from multiple IPs. The httpd.conf comments have no side to them, although I would suggest to the bot-author that, as the responsible adult within the room, he moderate some aspects of his bot for the sake of other webmasters, remembering what children are like when you give them toys.

jdMorgan

6:38 pm on Apr 29, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



The above is not crystal clear for me, but it seems to say that main-config mod_rewrite sections will be ignored unless a virtual-server includes a `RewriteOptions Inherit' directive. At least, that would make sense of my reality.

Correct.

Subjective opinions vary, but MJ12 is still not a scraper by definition. It is a distributed robot. A baseball or cricket bat is not a weapon by definition, despite the fact that it might be used as one.

As with most other 'bots, MJ12's user-agent has been spoofed by scrapers and other malicious user-agents, and its author has posted here on this subject.

As this is a technical forum, let's keep the definitions technical, rather than subjective or emotional. I believe that I sufficiently qualified my original statement to make my meaning clear.

Jim

AlexK

5:47 am on Apr 30, 2009 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Ah, no argument from me with either yourself nor with MJ12's author, Jim, and (not much of a) personal axe to grind.

As with most other 'bots, MJ12's user-agent has been spoofed by scrapers

I did not think of that - thanks for pointing it out (that must be a hard thing for the author to have to live with).

Most permanent blocks on my site are enacted within the firewall (and therefore by IP). The very few within httpd.conf follow on from demonstrated abuse on my site from multiple IPs where a firewall block would be either unwise or impractical. The comments within the section are for my benefit, so that at a later time I can know what & when to reverse.

Thank you again for your response, Jim. I have found your careful remarks (of all kinds) in this thread--and elsewhere--useful in giving me a different perspective, which always helps in teasing out the truth.