Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.
Feel free to use this on your own site and start blocking bots too.
(the top part is left out)<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]
2. I've never had a problem with people ripping my site with it.
I don't really see how FrontPage can be used similarly to something like Teleport Pro to rip a site ... it seems like it would be a very slow way to do it.
Anyway, it's easy to block if you want. Just add this line:
RewriteCond %{HTTP_USER_AGENT} FrontPage [NC,OR]
-Superman-
<FilesMatch "\.htm([l])*$">
ForceType application/x-httpd-php
</FilesMatch>
RewriteEngine On
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
............
RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org
RewriteRule !^http://www.mydomain.com/403.htm$ - [F,L]
My 404 works fine didn't really understand how to do the 500 tests
my filesmatch works fine
but the rewrite doesn't work, I actually download a trial of WebZip to test it and it allowed me to download my whole site..
I also used [wannabrowser.com...] typing in examples from the .htacces file such as WebZIP or WebCopier and it didn't show my 403.htm file
Any ideas anyone?
Thanks
Anni
When I use wannabrowser.com I just type in HTTP User Agent: WebSauger
and location: http://www.mydomain.com
right?
Anyway that's what I did to try to test it, but I het the html of my index page instead of my 403.htm page...
any other thoughts?
Anni
I'm glad you got it working too - I was away from WebmasterWorld today. But, as you found out, there's a lot of knowledge here among many members, and now you too are an experienced htaccess debugger!
carfac's comment on your rewrite rule above is correct: The rule's pattern (on the left) uses only a path, and the substitution (on the right) uses either a canonical URL (http://www.mydomain.com/substitutefile.html) or a path (/substitutefile.html). So as carfac said, the rule must read:
RewriteRule !^403.htm$ - [F,L]
mod_rewrite is like nitroglycerin - Very powerful, but don't drop it! One little typo can blow the whole thing up - a fact I was personally reminded of just two hours ago when I found I'd inadvertently introduced a syntax error into my spambot block RewriteRules, and essentially disabled the whole lot!
Jim
Welcome to WebmasterWorld!
We are discussing one aspect of htaccess here - banning bad user-agents by name. The other aspect (the one we are not addressing directly here) is banning by IP addresses, or by range of IP addresses.
Where banning by IP is concerned, two appraoches are often used - blocking the IP address manually in htaccess, and a script-based approach used to trap bad bots.
In the script-based approach, you put a link to a "trap file" on one or more of your pages. Then you Disallow that "trap file" in robots.txt. This "trap file" doesn't really exist - accessing it in defiance of the robots.txt actually invokes the script, which then automatically adds an IP-address-based block to you htaccess file.
To get you started blocking the more eggregious abusers, here's a way to block an IP address manually in htaccess:
# Troublesome AT&T Broadband user
RewriteCond %{REMOTE_ADDR} ^65.97.14.251$
However, more often the real bad guys use entire ranges of address, making the blocking more complex. If these are not clear you'll need to study up on the regular expressions used for pattern matching in mod-rewrite:
# Cyveillance
RewriteCond %{REMOTE_ADDR} ^63\.148\.99\.2(2[4-9]¦[34][0-9]¦5[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^63\.226\.3[34]\. [OR]
RewriteCond %{REMOTE_ADDR} ^63\.212\.171\.161$ [OR]
# Webcontent International
RewriteCond %{REMOTE_ADDR} ^65\.102\.12\.2(2[4-9]¦3[01])$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.102\.17\.(3[2-9]¦[4-6][0-9]¦7[0-1]¦8[89]¦9[0-5]¦10[4-9]¦11[01])$ [OR]
RewriteCond %{REMOTE_ADDR} ^65\.102\.23\.1(5[2-9]¦6[0-7])$
RewriteRule .* - [F,L]
Hope this helps,
Jim
Virtually all the Offline Browsers have the ability to cloak. However, most of them use their default UA unless it is changed by the user. It's safe to assume most users don't bother. If there is someone who takes the time, there is nothing you can really do about it.
Do you have access to your "raw" logs? If you do, I suspect you will see many of the Offline Browsers listed there. Some stat programs, such as OpenWebScope, usually only list things like IE, Netscape, etc. for some reason. Also, many webhosts that provide stats only list the Top 20 or some other finite # in the stat logs. Obviously the top of the list is going to be dominated by the popular browsers, while things that hit you less often are going to get left off.
Anyway, blocking by IP is easy:
<Limit GET>
order allow,deny
deny from 12.101.35.172
deny from 12.108.37.2
allow from all
</Limit>
In that example, anybody coming from IP 12.101.35.172 or 12.108.37.2 will get a 403 Forbidden page. You can also put in partial IP's to block an entire group. For example, 12.101 will block everything beginning with 12.101.
Be careful when blocking IP's, because you don't want to accidentally block something like AOL for example.
I keep this .htaccess only in my "members" directory, since that's where 99 percent of my site content is. I tend to get abused by Access Diver users, and they virtually always use multiple proxy's to try and brute force into my members area ... I take the proxy's from my logs, verify them using the very same program, and then add them to my .htaccess.
-Superman-