Welcome to WebmasterWorld Guest from

Forum Moderators: Ocean10000 & incrediBILL & phranque

Message Too Old, No Replies

Creating a close to perfect .htaccess ban list

...the right way...

7:20 pm on Feb 24, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:July 27, 2001
votes: 0

Two years ago I embarked on a journey that would eventually lead me to the Holy Grail of .htaccess access based security. Frustrated with what seemed to be a never-ending list of banned IPs and user-agents, I realized that I needed a better solution. Could I achieve the same or greater protections without banning a single IP? Would this even be possible?

I knew that attempting to do this would be a huge challenge. It would require hundreds of man hours devoted to browser sniffing, software coding, and research. In the end, I had no way of knowing if any of this hard work would even result in a practical solution. Nonetheless, I decided to go ahead and try.

The ground rules I laid out for myself were quite simple. The first rule, I couldn't ban a single IP or IP range. I also couldn't ban any portion of a user-agent. I would also have to take a leap of faith and allow all bots free range on my site until I could devise an agreeable solution to ban each one of them. Once completed, the Apache rules could not require more than 10k of code.

Now most of us can pretty much tell most bots from a web browser at first glance. Craftier bots sometimes require a little more investigation, but once you dig a little deeper into the header fields they send (or just as importantly, the ones they don't send) you can generally weed out the good hits from the bad. Header fields and their order is very important too. Web browsers send different header fields for different types of file requests. Some web browsers send unique header fields or values that set them apart from other browsers. For example, Firefox often uses a Keep-Alive header value of 115 whereas IE doesn't. Mobile browsers also send unique headers. With patience and a lot of testing, you will learn to identify the type of browser being used without even having to look at the user-agent.

You'll also need to carefully examine each hit originating from web robots. These bots often send unique headers depending on the language it was coded in. A Python or Perl bot might add a TE value in a Connection header. Likewise, PHP based bots often send a 300 value in a Keep-Alive header. Sometimes they send header values that do not conform to standards. For example, most of you are probably familiar with the orangeask referrer bot. What you probably didn't know, is it sends an invalid Accept header value that begins with q=. Notice the missing [0-1] preceding the dot? This error was so unique, that I was able to connect this bot to two others, including the original search engine bot the spammy orangeask software is using. By banning this one malformed value, I blocked at least three separate bots right off the bat and didn't have to resort to banning any IPs, user-agents, or referrers.

For obvious reasons, I cannot share my .htaccess file with you. I can share with you the methods I devised to create it and explain it's basic structure. The most important thing you will need to have is the ability to log all header fields sent by a user-agent. The standard Apache access logs will not suffice. I also recommend a JavaScript sniffer script to log any available values and compare them to the Header fields later. You should also familiarize yourself with the Hypertext Transfer Protocol -- HTTP/1.1 [w3.org]. And like I explained above, take the time to understand how differing brands of browsers commonly interact with files on the web server.

The basic structure of my .htaccess files begins with the following blocks of rules. These rules use the SetEnfIf directive to block the specified attributes/values using the ban variable.

# Allow User-Agents

This section uses a rather complex white list rule that only allows modern Mozilla or Opera user-agents.

# Allowed File Extensions

What file types does your site use? Allow only those requests. If necessary, ban any requests for specific files you don't want downloaded.

# Other Forbidden Header Attributes/Values

These are among the most important rules and they will vary depending on your security requirements. For example, I block proxies, cookies with invalid names or containing duplicate names/value, along with many header fields and values exclusively used by web robots. And of course, like the orangeask example above, invalid header attribute/values are banned. E.g.:

SetEnvIf Accept q=\. ban

# Allow Pass-thru Attributes:Values

These rules are mainly reserved for bots or certain proxies that I want to allow into my site. For example, to allow Googlebot into my site, I use:

SetEnvIf Remote_Addr ^(?:66\.249\.(?:6[4-9]|[78][0-9]|9[0-5])|74\.125)\. !ban pass bb_bsl_off=plugin:trap

The ban variable is unset by the !ban rule. I'll explain the pass variable later.

I also use this section to allow global access to certain files on my site. For example, to allow robots.txt, use:

SetEnvIf Request_URI ^/robots\.txt$ pass

The final part of my .htaccess file uses four blocks of mod_rewrite code. I'll share the first block.

# Forbidden Invalid Referrers
RewriteCond %{ENV:pass} !^1$
RewriteCond %{HTTP_REFERER} !^(?:https?://[a-z0-9\-.]{4,253}(?::\d+)?(?:/.*)?)?$
RewriteRule .* /cgi-bin/banbots.pl?403 [L]

With this example you can see how the pass variable is intended to operate.

The remaining blocks of code are somewhat complicated, but I will try to explain them as well as I can. Most of them require that each user-agent send the proper headers for each file request. This isn't a problem for real web browsers but bots choke on these rules all the time.

My sites also require the visitor to accept a simple cookie. Cookie ignoring bots that hit the site using a local referrer on the first and subsequent requests will be banned. If the user-agent accepts the cookie, a plugin will activate and a JavaScript sniffer file is sent to the browser.

Well I hope this information is of use to you. If you have any questions, I'll try my best to answer them. I'm happy to report that my .htaccess file is 5.2k, with less than 3.5k being used for access security. Not one IP banned. Not one user-agent banned. My site loads so much faster now. More importantly, I no longer have to tend to a growing and outdated list of banned IPs and user-agents.
8:07 pm on Feb 24, 2012 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month

joined:Aug 29, 2006
votes: 0

I hope this information is of use to you

Fascinating read, Key_Master, thank you.

I can't help thinking, though, that if a lot of webmasters did this stuff then it would not be very long before the botrunners upped their game - it is an arms race, after a fashion.

I'm happy to report that my .htaccess file is 5.2k

Impressive indeed.


Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members