homepage Welcome to WebmasterWorld Guest from 54.227.56.174
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Creating a close to perfect .htaccess ban list
...the right way...
Key_Master




msg:4421459
 7:20 pm on Feb 24, 2012 (gmt 0)

Two years ago I embarked on a journey that would eventually lead me to the Holy Grail of .htaccess access based security. Frustrated with what seemed to be a never-ending list of banned IPs and user-agents, I realized that I needed a better solution. Could I achieve the same or greater protections without banning a single IP? Would this even be possible?

I knew that attempting to do this would be a huge challenge. It would require hundreds of man hours devoted to browser sniffing, software coding, and research. In the end, I had no way of knowing if any of this hard work would even result in a practical solution. Nonetheless, I decided to go ahead and try.

The ground rules I laid out for myself were quite simple. The first rule, I couldn't ban a single IP or IP range. I also couldn't ban any portion of a user-agent. I would also have to take a leap of faith and allow all bots free range on my site until I could devise an agreeable solution to ban each one of them. Once completed, the Apache rules could not require more than 10k of code.

Now most of us can pretty much tell most bots from a web browser at first glance. Craftier bots sometimes require a little more investigation, but once you dig a little deeper into the header fields they send (or just as importantly, the ones they don't send) you can generally weed out the good hits from the bad. Header fields and their order is very important too. Web browsers send different header fields for different types of file requests. Some web browsers send unique header fields or values that set them apart from other browsers. For example, Firefox often uses a Keep-Alive header value of 115 whereas IE doesn't. Mobile browsers also send unique headers. With patience and a lot of testing, you will learn to identify the type of browser being used without even having to look at the user-agent.

You'll also need to carefully examine each hit originating from web robots. These bots often send unique headers depending on the language it was coded in. A Python or Perl bot might add a TE value in a Connection header. Likewise, PHP based bots often send a 300 value in a Keep-Alive header. Sometimes they send header values that do not conform to standards. For example, most of you are probably familiar with the orangeask referrer bot. What you probably didn't know, is it sends an invalid Accept header value that begins with q=. Notice the missing [0-1] preceding the dot? This error was so unique, that I was able to connect this bot to two others, including the original search engine bot the spammy orangeask software is using. By banning this one malformed value, I blocked at least three separate bots right off the bat and didn't have to resort to banning any IPs, user-agents, or referrers.

For obvious reasons, I cannot share my .htaccess file with you. I can share with you the methods I devised to create it and explain it's basic structure. The most important thing you will need to have is the ability to log all header fields sent by a user-agent. The standard Apache access logs will not suffice. I also recommend a JavaScript sniffer script to log any available values and compare them to the Header fields later. You should also familiarize yourself with the Hypertext Transfer Protocol -- HTTP/1.1 [w3.org]. And like I explained above, take the time to understand how differing brands of browsers commonly interact with files on the web server.

The basic structure of my .htaccess files begins with the following blocks of rules. These rules use the SetEnfIf directive to block the specified attributes/values using the ban variable.

# Allow User-Agents
This section uses a rather complex white list rule that only allows modern Mozilla or Opera user-agents.


# Allowed File Extensions
What file types does your site use? Allow only those requests. If necessary, ban any requests for specific files you don't want downloaded.


# Other Forbidden Header Attributes/Values
These are among the most important rules and they will vary depending on your security requirements. For example, I block proxies, cookies with invalid names or containing duplicate names/value, along with many header fields and values exclusively used by web robots. And of course, like the orangeask example above, invalid header attribute/values are banned. E.g.:

SetEnvIf Accept q=\. ban


# Allow Pass-thru Attributes:Values
These rules are mainly reserved for bots or certain proxies that I want to allow into my site. For example, to allow Googlebot into my site, I use:

SetEnvIf Remote_Addr ^(?:66\.249\.(?:6[4-9]|[78][0-9]|9[0-5])|74\.125)\. !ban pass bb_bsl_off=plugin:trap

The ban variable is unset by the !ban rule. I'll explain the pass variable later.

I also use this section to allow global access to certain files on my site. For example, to allow robots.txt, use:

SetEnvIf Request_URI ^/robots\.txt$ pass



The final part of my .htaccess file uses four blocks of mod_rewrite code. I'll share the first block.

# Forbidden Invalid Referrers
RewriteCond %{ENV:pass} !^1$
RewriteCond %{HTTP_REFERER} !^(?:https?://[a-z0-9\-.]{4,253}(?::\d+)?(?:/.*)?)?$
RewriteRule .* /cgi-bin/banbots.pl?403 [L]


With this example you can see how the pass variable is intended to operate.


The remaining blocks of code are somewhat complicated, but I will try to explain them as well as I can. Most of them require that each user-agent send the proper headers for each file request. This isn't a problem for real web browsers but bots choke on these rules all the time.

My sites also require the visitor to accept a simple cookie. Cookie ignoring bots that hit the site using a local referrer on the first and subsequent requests will be banned. If the user-agent accepts the cookie, a plugin will activate and a JavaScript sniffer file is sent to the browser.

Well I hope this information is of use to you. If you have any questions, I'll try my best to answer them. I'm happy to report that my .htaccess file is 5.2k, with less than 3.5k being used for access security. Not one IP banned. Not one user-agent banned. My site loads so much faster now. More importantly, I no longer have to tend to a growing and outdated list of banned IPs and user-agents.

 

Samizdata




msg:4421469
 8:07 pm on Feb 24, 2012 (gmt 0)

I hope this information is of use to you

Fascinating read, Key_Master, thank you.

I can't help thinking, though, that if a lot of webmasters did this stuff then it would not be very long before the botrunners upped their game - it is an arms race, after a fashion.

I'm happy to report that my .htaccess file is 5.2k

Impressive indeed.

...

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved