Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.
Feel free to use this on your own site and start blocking bots too.
(the top part is left out)<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]
I've left part 2 as is because I need to exclude shtml and shtm and this was the only combination I could get to work at the time...
I've changed the last bit of Part 3 as you suggested.
Just 3 quick questions if I can, what does the
RewriteRule !^403.htm$ - [F,L] (in particular the F,L bit) mean at the end?
Is it better to use absolute or relative paths for the error documents?
How can I check that it's working once I've uploaded it?
Thanks again
Anni
RewriteRule !^403.htm$ - [F,L] (in particular the F,L bit) mean at the end?
RewriteRule- The conclusion of all thos conditions, and the terminator of that. In means that if any (*you use [OR]) of those conditions exist, do this.
!- this means not
^- means the request string BEGINS EXACTLY
403.htm- name of your 403 file
$- ends exactly (the above will not match 403.html!)
- means do nothing
F means this is forbidden- return FORBIDDEN (403)
L means LAST, do nothing further and end all rewrite rules for any requests effected by this block
Is it better to use absolute or relative paths for the error documents?
for the first part of the rewriterule, use a URI from the root directory
for the seconf part, you HAVE to use a full URL (http://www.domain.com/)
How can I check that it's working once I've uploaded it?
I would recommend going to [wannabrowser.com...]
and spoofing your UA!
Good Luck!
dave
I've left part 2 as is because I need to exclude shtml and shtm and this was the only combination I could get to work at the time...
The change I suggested will do the same. You just had a somewhat complicated and inefficient regex pattern saying, "match anything that ends with htm or html". The two methods are equivalent, except that the original method would match htm, html, htmll, htmlll, or htmlllllllllllllllllllll, etc.
The new method will match anything that ends with htm or html only.
If you want to exclude shtm and shtml files from the match, use <FilesMatch "\.html?$"> which will require the path to end with ".htm" or ".html".
Just 3 quick questions if I can, what does the
RewriteRule !^403.htm$ - [F,L] (in particular the F,L bit) mean at the end?
If the conditions match:
Rewrite any requested URL except 403.htm to (blank URL), return a Forbidden server status code, and stop processing rewrite rules, this is the Last one to process. The result is that any banned User-Agent will receive a 403-Forbidden server response, and it will be redirected to your custom 403 error page, 403.htm (which is why you don't want to rewrite that URL if it is subsequently requested). Most bad-bots will not follow this redirect, but that's OK.
Is it better to use absolute or relative paths for the error documents?
ErrorDocument paths must be relative, otherwise a 302-Moved Temporarily response code will be sent to the requesting client, masking the correct error code.
How can I check that it's working once I've uploaded it?
That's tricky... The first thing to check is whether you can still access your web site. Various errors in .htaccess can result in a 500-Server Error code being returned, and your site will be inaccessible. Be ready to remove your new .htaccess and replace it with a known-good backup if this happens! Then view your server error log to find out what caused the server error.
The next part can be done several ways. Checking to see whether your User-agent blocks can be accomplished by modifying your registry entries for Internet Explorer to make it send a blocked User-agent string. Do this only if you are familiar with registry backups and editing! Otherwise, you can simply check your log files once in a while to confirm that bad bots are being blocked as expected.
Testing the custom 404 error document is easy, just request a non-existent page from your site. Testing the 500-series codes is more difficult, since you will need to create redirects for several non-existent files and then request those files in order to test the custom handlers:
RewriteRule ^test501.htm$ - [R=501,L]
RewriteRule ^test502.htm$ - [R=502,L]
etc.
Also, unless you are handling password logins with a custom script, I suggest that you do not redirect 401s to a custom error document.
Again, spending some time reviewing the Apache server documentation [httpd.apache.org] will clear up many questions. I print it out once a year, or when my current copy is worn out, whichever comes first! :)
Jim
[edited by: jdMorgan at 12:59 am (utc) on Oct. 5, 2002]
.htaccess rocks, and there are many other things I use it for. Here is a great one for preventing people from hotlinking your files:
RewriteEngine On
RewriteCond %{HTTP_REFERER}!^http://([a-z0-9-]+\.)*yourdomain.com/ [NC]
RewriteCond %{HTTP_REFERER}!^http://([a-z0-9-]+\.)*12.345.67.890/ [NC]
RewriteRule /* http://yourdomain.com [L,R]
Obviously "yourdomain.com" would be your domain name, and the "12.345.67.890" would be your site's domain #.
For example, I use this one in my "images" directory to prevent people from hotlinking my images on their sites. I also have it in my "logs" folder so they can't view my site logs.
-Superman-
-Superman-
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^BlackWidow [OR]
RewriteCond %{HTTP_USER_AGENT} ^Bot\ mailto:craftbot@yahoo.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ChinaClaw [OR]
RewriteCond %{HTTP_USER_AGENT} ^Custo [OR]
RewriteCond %{HTTP_USER_AGENT} ^DISCo [OR]
RewriteCond %{HTTP_USER_AGENT} ^Download\ Demon [OR]
RewriteCond %{HTTP_USER_AGENT} ^eCatch [OR]
RewriteCond %{HTTP_USER_AGENT} ^EirGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^EyeNetIE [OR]
RewriteCond %{HTTP_USER_AGENT} ^FlashGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetRight [OR]
RewriteCond %{HTTP_USER_AGENT} ^GetWeb! [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go!Zilla [OR]
RewriteCond %{HTTP_USER_AGENT} ^Go-Ahead-Got-It [OR]
RewriteCond %{HTTP_USER_AGENT} ^GrabNet [OR]
RewriteCond %{HTTP_USER_AGENT} ^Grafula [OR]
RewriteCond %{HTTP_USER_AGENT} ^HMView [OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Stripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^Image\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^InterGET [OR]
RewriteCond %{HTTP_USER_AGENT} ^Internet\ Ninja [OR]
RewriteCond %{HTTP_USER_AGENT} ^JetCar [OR]
RewriteCond %{HTTP_USER_AGENT} ^JOC\ Web\ Spider [OR]
RewriteCond %{HTTP_USER_AGENT} ^larbin [OR]
RewriteCond %{HTTP_USER_AGENT} ^LeechFTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mass\ Downloader [OR]
RewriteCond %{HTTP_USER_AGENT} ^MIDown\ tool [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mister\ PiX [OR]
RewriteCond %{HTTP_USER_AGENT} ^Navroad [OR]
RewriteCond %{HTTP_USER_AGENT} ^NearSite [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetAnts [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Net\ Vampire [OR]
RewriteCond %{HTTP_USER_AGENT} ^NetZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Octopus [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Explorer [OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ Navigator [OR]
RewriteCond %{HTTP_USER_AGENT} ^PageGrabber [OR]
RewriteCond %{HTTP_USER_AGENT} ^Papa\ Foto [OR]
RewriteCond %{HTTP_USER_AGENT} ^pavuk [OR]
RewriteCond %{HTTP_USER_AGENT} ^pcBrowser [OR]
RewriteCond %{HTTP_USER_AGENT} ^RealDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^ReGet [OR]
RewriteCond %{HTTP_USER_AGENT} ^SiteSnagger [OR]
RewriteCond %{HTTP_USER_AGENT} ^SmartDownload [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperBot [OR]
RewriteCond %{HTTP_USER_AGENT} ^SuperHTTP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Surfbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^tAkeOut [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ Pro [OR]
RewriteCond %{HTTP_USER_AGENT} ^VoidEYE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Image\ Collector [OR]
RewriteCond %{HTTP_USER_AGENT} ^Web\ Sucker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebAuto [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebFetch [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebGo\ IS [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebLeacher [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebReaper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebSauger [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ eXtractor [OR]
RewriteCond %{HTTP_USER_AGENT} ^Website\ Quester [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebStripper [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebWhacker [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^Widow [OR]
RewriteCond %{HTTP_USER_AGENT} ^WWWOFFLE [OR]
RewriteCond %{HTTP_USER_AGENT} ^Xaldon\ WebSpider [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus
RewriteRule ^.* - [F,L]
Some notes:
1. The [1] at the begining and the [2] at the end of my original file was added sometime after this site's format changed (something to do with the formatting codes). Anyway, they did not belong there.
2. All the things in my list have been thorougly researched. 90 percent of them are Site Downloaders. There are also some email harvesters and other evil things (like VoidEye).
3. If you know a bot respects robots.txt, put it there. It will shorten your list (see my post above). If anybody sees something in my list that definitely obeys robots.txt, please let me know.
4. Adding the [NC,OR] to all of your entries will only make your file that much bigger. 99 percent of these things always use the exact useragent name. If there are anomolies (like httrack), then by all means make it case-insensitive. Same with the ^ character. They always start the same way.
-Superman-