Forum Moderators: phranque

Message Too Old, No Replies

Using mod rewrite at a server level

         

topr8

5:22 pm on Jun 22, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Currently I have a sever with multiple websites/virtualhosts (apache 2.2.3)

My object is to block bots/scrapers from all the websites by the easiest method.

At this moment i prepend a .php file to all pages hosted on the server
eg. php_value auto_prepend_file "/var/www/html/myprependfilehere.php"

in this file there are routines to block some bots/scrapers by ipaddress, useragent, county (by looking up in a database) and so on - it's not perfect but it keeps a lot away.

...

i'm thinking it would be more efficient to do some of this stuff, notably the blocking by ipaddress and user_agent directly in apache using the httpd.conf

my issue is that:

1. i would want to write the rules once and have them apply to all the virtualhosts.
2. i would want to log all the blocks in a separate log file.
3. i want to allow access to the robots.txt file on every domain on the server.

Of course i've seen a great many examples here and elsewhere of how to block by ipaddress and user_agent in a specific .htaccess file or specific directory. i'm wondering if i have to do anything different?


This is where i've got to so far:

<Directory "/var/www/html">
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Indy Library [NC.OR]
RewriteCond %{HTTP_USER_AGENT} ^(80legs.com|Nutch|Wget/) [NC.OR]
RewriteCond %{Remote_Addr} ^198\.65$ [NC.OR]
RewriteCond %{Remote_Addr} ^123\.456\.123\.45$ [NC]
RewriteRule ^ - [F]

</Directory>


<VirtualHost *:80>
DocumentRoot "/var/www/html/example.com"
ServerName example.com
<Directory "/var/www/html/example.com">
various stuff here
</Directory>
</VirtualHost>

<VirtualHost *:80>
DocumentRoot "/var/www/html/example2.com"
ServerName example2.com
<Directory "/var/www/html/example2.com">
various stuff here
</Directory>
</VirtualHost>

<VirtualHost *:80>
DocumentRoot "/var/www/html/example3.com"
ServerName example3.com
<Directory "/var/www/html/example3.com">
various stuff here
</Directory>
</VirtualHost>

all virtual hosts have a document root in the format of
/var/www/html/domain name here

i wondering how i punch a hole and allow access to all robots.txt files
and if i put the rules in the directory folder if it will then apply to all the virtual directories

and also if this is the best way to go about this - i'm assuming it will be quicker and use less resources than the php appended file would.

incrediBILL

4:37 pm on Jun 23, 2012 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



My object is to block bots/scrapers from all the websites by the easiest method.


Then get a bot blocking script as those are built by people that do this all the time and keep it updated so you're less likely to nuke something you need.

Also, if you continue using Apache to do this, don't blacklist user agents, whitelist, otherwise you could clog the server processing thousands of conditions per page request.