Forum Moderators: phranque
While my statistics are not currently public (currently redeveloping awfreak (awstats but more indepth)) I like to provide a good example for others.
How could I create a blacklist based on words in the referrer (without actually banning a domain but rather if the domain contained that word).
For example, lets say I was offended by the words...
cat
cheese
I'd like these words in the filter to then naturally block referrers such as...
www.eatmycat.com
www.bigroundpillows.net
Also I'd like to make sure nothing is served to any referrers until they pass these filters. Wouldn't make much sense if they get past the filters and end up in the access logs (and thus parsed by the statistics scripts (where I'll have to manually remove them anyway)).
The requests with those referrers aren't normal surfers. It's just a special program that requests a page from your site, but ignores the actual data. Blocking those may minmially reduce bandwidth costs, but the offending referrers will still appear in your logs (just with a 403 instead of a 200 result code).
There's only one reasonable solution: Ignore the spammers and don't publish your logs.
I would like to protect inocent users just in case, how would I specify a 403 error page? I tried variations of code (from other posts) that you suggested (and modified it as I thought might work) but to no avail.
Here is the code I was playing with...
RewriteCond $1!^favicon\.ico$
RewriteCond %{REQUEST_METHOD}!^HEAD$
RewriteCond %{HTTP_REFERER}<->%{HTTP_USER_AGENT} ^<->$
RewriteRule \.(htm?l¦php¦txt¦css¦js)$ error/error-403-ua.php [L]
ErrorDocument 403 /path_to_custom_403_page.html
#
RewriteCond %{HTTP_REFERER} (cat¦cheese) [NC]
RewriteRule \.(htm?l¦php¦css¦js)$ - [F]
All this 403 will do is to get a few of them to take you off their list, but not very many. Their goal is to create a log entry, and their automated user-agents are very dumb because they don't have to be smart. They simply do a HEAD or a GET from every domain they can find, and play the numbers; most sites don't have public logs or stats, but it's still worth it for them because a few do.
The only way to keep them out of your log file is to block them by IP address at the server's firewall. But of course, many of them are using open proxies, so the block lists tend to be huge and must be maintained.
Jim