Forum Moderators: phranque
After spending a fair number of hours reading [webmasterworld.com...] and all its varients, performing a search on Webmaster world and Google I am completely confused. A number of people have posted excellent scripts, good .htaccess files and various perl and php bot traps. I am quite new to .htaccess and at the moment my .htaccess only contains basic information, like denying directory access, preventing .htaccess viewing and some redirects.
After reading the threads mentioned, I am confused as to what approach I should take. I understand the threat of bad user agents, site copiers, etc etc but I am unsure how I should go about doing something about them.
I found this useful code in one of the .htaccess threads :-
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} BlackWidow¦Bot\ mailto:craftbot@yahoo.com¦ChinaClaw¦DISCo¦Download\ Demon¦eCatch¦EirGrabber¦EmailSiphon¦Express\ WebPictures¦ExtractorPro¦EyeNetIE¦FlashGet¦GetRight¦Go!Zilla¦ Go-Ahead-Got-It¦GrabNet¦Grafula¦HMView¦HTTrack¦Image\ Stripper¦Image\ Sucker¦InterGET¦Internet\ Ninja¦JetCar¦JOC\ Web\ Spider¦larbin¦LeechFTP¦Mass\ Downloader¦MIDown\ tool¦Mister\ PiX¦Navroad¦NearSite¦NetAnts¦NetSpider¦Net\ Vampire¦NetZIP¦Octopus¦Offline\ Explorer¦Offline\ Navigator¦PageGrabber¦Papa\ Foto¦pcBrowser¦RealDownload¦ReGet¦Siphon¦SiteSnagger¦SmartDownload¦ SuperBot¦SuperHTTP¦Surfbot¦tAkeOut¦Teleport\ Pro¦VoidEYE¦Web\ Image\ Collector¦Web\ Sucker¦WebAuto¦WebCopier¦WebFetch¦WebReaper¦WebSauger¦Website\ eXtractor¦WebStripper¦WebWhacker¦WebZIP¦Wget¦Widow¦Xaldon\ WebSpider¦Zeus
RewriteRule .* - [F,L]
Obviously you would have to doublecheck to make sure there was nothing there you would not want to block but would this do a fair job of blocking bots? I am amazed to as to the length of some of the .htaccess files I see, do they need to be this length to stop all the nasty files/bots?
Then there is the code found here [webmasterworld.com]. Should I use this instead (the server my files are on does run php)?
Then there is this [diveintomark.org...] that uses like 50 Rewrite rules - surely this would make your site realy slow?
I guess what I am trying to find out is what is the best way to block all the nasty files/tools/ua's that are out there. Nothing is currently blocked at the moment so I guess anything would be an improvement. If you go down the route of the .htaccess surely you would have to update this everytime something new and horrible comes out?
Choices, choices..argh! Can anyone help?
[edited by: jdMorgan at 1:40 am (utc) on Nov. 28, 2003]
[edit reason] Corrected link syntax [/edit]
The answer to your question depends on several factors, for example:
The busier your site is, the less CPU time you'll have to run .htaccess user-agent tests, bad-bot traps, and bandwidth limiters.
But if your site has lots of plain-text e-mail addresses, and your content is intrinsically valuable to a site downloader, then you have better reason to guard it well.
Some methods immediately available are:
You can use the password-protection mechanism built into Apache to protect sensitive data. You can screen for User-agent, screen for troublesome IP addresses and IP ranges, and detect requests for certain files that signal various types of attacks and exploit attempts, all using mod_access and mod_rewrite in .htaccess. You can install a bad-bot trap to catch those robots and other automated agents which do not honor robots.txt. And you can install a bandwidth-limiting script to limit requests from a given IP address per unit time.
I use all but the last method myself. Some here will advise varying degrees of enforcement, but a pragmatic approach might be to go through your site statistics for the previous year. Note all of the troublesome user-agents that you find hitting your site often. Include those in .htaccess using the methods you've found posted here. If you have a very busy site with lots of tempting content, you may need to block all of the user-agents you've seen in those lists pre-emptively. But it probably won't be necessary. At some point, you reach a point of diminishing returns; Every user-agent you add to the list slows down your server incrementally. But some of those troublemakers only hit your site a few times a year, and might be better handled on a per attempt basis. This is a trade-off that only you can make.
In order to cover the User-agents you don't have in a User-agent ban list, and to catch those that no-one knows about yet, install key_master's bad-bot script or one of the variants posted here. This script catches 'bots that ignore robots.txt, and writes their IP address into your .htaccess file to ban them. You can also have it log the user-agent and any other info you might find useful. If you see a new troublemaker User-agent appearing often, it might be better to remove those IP address blocks from .htaccess and ban by User-agent instead. Sometimes you'll notice multiple attempts from within the same IP address range. When that happens, it may be worthwhile to research that IP range, see who owns it, and combine the individual related IP blocks into a single one that covers the entire range owned by that entity.
There are several ways to limit bandwidth: the PHP script you found, the mod_throttle module for Apache, and other similar programs and scripts. These can also be used as additonal components of site defense.
There are a few ways you can limit the performance impact of all this on your server. First, combine multiple Rewrite directives into one using the OR'ing technique used in the code you posted, rather than having an individual mod-rewrite directive for each and every user-agent and IP address.
Secondly, if possible, move the really bad and consistent troublemaker IP and User-agent blocks from .htaccess to httpd.conf. The mod_rewrite code in httpd.conf is compiled whenever the server is restarted, and as a result it runs much faster than the same code in .htaccess would (because .htaccess code is interpreted on each HTTP request).
Thirdly, look to 'exporting' some of these access controls to router or firewall hardware if possible. Blocking requests from known-bad IP addresses stops those requests from ever reaching your server, and this is the most efficient on-site solution possible.
Lastly, where a 'software' approach is used, take care to code efficiently. Combine lines as mentioned above, use start and end-anchoring appropriately to speed up regular-expresions pattern matching, and look at partitioning your files into directories based on the amount of protection needed. Individual .htaccess files per directory can be used to provide strong protection where needed, while leaving higher-level directories and other peer-level directories with less-but-adequate protection optimized more for speed.
As for the server performance impact, that would depend on your server load and how you implement these blocks. Many factors come into play, including network connection bandwidth, disk access speed, local (server) caching, available memory, and CPU/memory speed. In most cases, all but the last two factors will swamp any
additional delays caused by blocking-code execution. Remember that both httpd.conf and .htaccess code are native to Apache, and so execute pretty darn fast compared to other external scripts.
Well, there are some things to think about, and more that a few argument-, er, discussion-starters, too, I'll bet...
Jim
Yes I would say that my site, in my opinion, is busy. While I would not say the content of the site is valuable or particually useful to anyone not interested in the subject area, I would still like to protect it from site grabbers. The content of the site has taken a long time to research, compile and structure and I wouldn't like to think someone with a site grabber could steal or replicate my site in next to no time at all.
I think I looked at key_masters post about the bad-bot traps, I will have to give it another look and try and get it up and running. I'm not the best at perl but it looks not too bad, so I will give it a try.
I don't think I have access to httpd.conf as I have space with a hosting company on one of their virtual servers. I only realised a short time ago that I could use .htaccess, what it was and the benefits of it.
Thanks again Jim for taking the time to help me in this issue , it really is much appreciated.