Currently I have a sever with multiple websites/virtualhosts (apache 2.2.3)
My object is to block bots/scrapers from all the websites by the easiest method.
At this moment i prepend a .php file to all pages hosted on the server
eg. php_value auto_prepend_file "/var/www/html/myprependfilehere.php"
in this file there are routines to block some bots/scrapers by ipaddress, useragent, county (by looking up in a database) and so on - it's not perfect but it keeps a lot away.
...
i'm thinking it would be more efficient to do some of this stuff, notably the blocking by ipaddress and user_agent directly in apache using the httpd.conf
my issue is that:
1. i would want to write the rules once and have them apply to all the virtualhosts.
2. i would want to log all the blocks in a separate log file.
3. i want to allow access to the robots.txt file on every domain on the server.
Of course i've seen a great many examples here and elsewhere of how to block by ipaddress and user_agent in a specific .htaccess file or specific directory. i'm wondering if i have to do anything different?
This is where i've got to so far:
<Directory "/var/www/html">
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^Indy Library [NC.OR]
RewriteCond %{HTTP_USER_AGENT} ^(80legs.com|Nutch|Wget/) [NC.OR]
RewriteCond %{Remote_Addr} ^198\.65$ [NC.OR]
RewriteCond %{Remote_Addr} ^123\.456\.123\.45$ [NC]
RewriteRule ^ - [F]
</Directory>
<VirtualHost *:80>
DocumentRoot "/var/www/html/example.com"
ServerName example.com
<Directory "/var/www/html/example.com">
various stuff here
</Directory>
</VirtualHost>
<VirtualHost *:80>
DocumentRoot "/var/www/html/example2.com"
ServerName example2.com
<Directory "/var/www/html/example2.com">
various stuff here
</Directory>
</VirtualHost>
<VirtualHost *:80>
DocumentRoot "/var/www/html/example3.com"
ServerName example3.com
<Directory "/var/www/html/example3.com">
various stuff here
</Directory>
</VirtualHost>
all virtual hosts have a document root in the format of
/var/www/html/domain name here
i wondering how i punch a hole and allow access to all robots.txt files
and if i put the rules in the directory folder if it will then apply to all the virtual directories
and also if this is the best way to go about this - i'm assuming it will be quicker and use less resources than the php appended file would.