I am in Bill's camp when it comes to this. It's really not *hard* to white list properly it just takes a little homework and a decent sized set of log file history. If you're sophisticated enough to make your own rewrite rules to catch bad user agents, you're got the 95% of the skill set to look at other headers and write rules based on those.
For me in the long run white listing result in less effort, not more. Blacklists, unfortunately, must be constantly updated plus they are terribly ineffective. Long gone are the days when they would catch most bots... in recent years I have found they catch the minority of bots.
What it takes to white list without giving the bad guys the info we don't want them to have.
- Log all headers for a month or so initially, not just user agent, IP, request page, referrer, etc.
- Hitting the site yourself with every browser and versions of that browser you can lay your hands on. Not just on your PC, but mobile browsers as well via services like Browser Cam. On my Android phone I must have installed at least a dozen browsers (luckily 95% of them use the same "engine" so the headers aren't unique!). Log the headers so you can see what all these send. Make sure you do this through a non-proxy connection.
- Using the header information from above, white list based on user agent, Accept-Encoding, Connection, and Accept headers when there are no proxy headers. I won't list proxy headers here, this information is readily available elsewhere but there's roughly a dozen which cover 99% of proxies humans use. If you want to accept only users with specific languages, also filter Accept-Language. You've now white listed 95% of valid users.
- Build a list of IP ranges from search engines you allow, and the user agents you want to allow from these. Whitelist them. Any search engine user agent not matching the valid IP ranges you block. These IP ranges are also available on webmasterworld.
- Next, filter your logs using grep, awk, etc. or similar tools on your platform for items which do not match the above preliminary white list. You want a large sampling so depending on your traffic this may be a week, a month or months worth of logs. Whenever you get something which doesn't pass the filters, look at the headers closely, especially the various proxy headers. Manually determine with IP look up tools, analysis of time between page fetches, etc. if this is a human or not. If it is, add the unique header combinations (usually user agent, accept-encoding, connect and one of the proxy headers) to the white list.
- Now you're at 99.9% or better! Sounds good, but 1 out of 1000 visitors blocked is too many for me. I like to get it better than 99.99%+ so next...
- Anything blocked you rewrite to a human verification page. Have the results of the human verification page emailed to you along with all headers. Use this information to add a few additional rules to your whitelist.
I found that doing the above, the first couple of weeks I ended up adding a handful of additional rules to my white list. After that I found I started saving time, expending far less effort chasing down bots, etc.
- First couple of months after this at the end of the month I went back through the logs looking for additional whitelist items. Now, a year later I spend very little time chasing bots.
- All this has resulted in a much smaller set of rules, and since these days page load time can be a ranking factor it also slightly speeds up time to first byte delivered.
Additional things which will result in less pulling out of your hair in the long run:
- Depending on your user base you may want to block certain countries. I choose to do this at the firewall level instead of rewrite rules or deny rules. This is far more CPU efficient (better time to first byte) plus it can also block out a ton of SSH probes, SMTP attacks, etc.).
- Use mod_geoip to block anonymous proxies. mod_geoip caches IP lookups so the overhead is low.
- Setup an hourly script to update a list of Tor exit nodes which connect to your server's IP(s) and block these.
- Setup a free account at projecthoney pot and install their Apache module. The module has a few settings you can setup to block recent comment spammers, harvesters, spam servers, dictionary attackers, and "rule breakers". To avoid overhead and increasing time to first byte, I have it setup to only do a lookup on any pages which require human input such as registration and contact pages.
- If you have a VBulletin forum, install Spambot Stopper plugin. I have found this plugin to be more effective than all other spam/crawler related plugs combined. It has helped me to catch and block via the firewall numerous IP ranges of server farms with very sophisticated bots which emulate real user headers effectively enough that they get through the white list.
- If you use a Content Delivery Network, do not use it for your actual content (ie, HTML pages and images). Use it for css, javascript and navigation images but not other images. Block any CDN requests which do not fetch these. Reason being is I found that many crawlers will attempt to get to your content via the CDN. If you are in a situation where you must serve actual content via the CDN, make sure the CDN company has a method or plug-in/module for your web server to receive the end-user's real header information so you can properly white list it.
As a side note...
Lucy24, my white list lets stuff like your browser through... I ran into that early on and addressed it. :)