Forum Moderators: phranque
I'm currently working on a bot-trap type of project incorporating a mySQL database and a type of admin panel which will allow the user (web programmer) to add or remove entries to the .htaccess file without having much knowledge of .htaccess.
I'm aware of the potential for disaster inside an .htaccess file, but I'm sure if the code is written correctly there shouldn't be any worries. Extensive testing for sure.
My question is about he user agent banning. Currently the bot trap will ban IP addresses (Deny from X) when the IP does something that makes it bad (ignoring robots.txt for example). That IP deny gets written to .htaccess and thrown into a database of banned IPs. What I would like to do is allow the user to also ban the User Agent related to that IP by selecting the log entry from a table. The program would then get the UA from the database, wrap the appropriate code around it, and put it in the .htaccess file at the proper location. The User Agent string is retrieved from the server as a string containing the entire UA.
There's no way, at least none that I know of, to whittle a UA down to the important bit (ie HTTrack) from the whole UA string via a program. It's something that a human needs to evaluate. That being the case, the only option I would have is to include the entire UA string in the .htaccess file like:
RewriteCond %{HTTP_USER_AGENT} ^The sometimes incredibly long user agent string goes here$
Is there anything wrong with that? Does it violate a rule anywhere... will it cause any problems... it is just generally a bad idea... or will it be ok?
Thanks for any responses!
One thing that can really help keep things simple is to use the SetEnvIf directive to set a variable --commonly called "getout" in posts here-- and then test that variable later using a single "Deny from" or RewriteCond.
The advantage is that you simply prepend records to the .htaccess file, which save the trouble of having to read it in a line at a time, parse it, and find the right "instertion point." Remember, you're going to need to flock() the .htaccess file to prevent two or more concurrent threads from trying to 'edit' it simultaneously. If you don't flock the file, then the last thread to write to it 'wins' and the other threads' entries will be lost. So, simply prepending new records is both simple and fast, and requires the file to be locked for the shortest possible time.
Take a look at the various versions of key_master's bad-bot PERL script, and xlcus'/alexk's runaway 'bot PHP script -- they're sure to give yousome ideas...
As to the UA string, sure you can put the whole thing in there, just be sure to escape all 'special' regex characters or to put the whole string in quotes.
Jim
[edited by: jdMorgan at 11:41 pm (utc) on April 30, 2007]
I'll check into that SetEnvIf information. I've got a version of a bot-trap that I'm using as a rough model, and I have the flock down and the insertion already working. In PHP my find-point-then-insert code is only one command long so it shouldn't be too bad as far as access time. I started out only adding lines to the end of .htaccess but it got pretty messy looking fast... so I wanted to try to keep it as clean as possible.
Thanks for the heads up though. If I didn't have my base bot-trap script to work from I would have never thought to lock the file.