Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.
Feel free to use this on your own site and start blocking bots too.
(the top part is left out)<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]
I have been thinking for some time now that I should implement a script that limits the number of page views for a unique visitor. This script would help stop all those folks who change thier useragent name to gain a offline copy of my website or other unidentified email bots etc. Now this script would allow all the good spiders and other simular folks all the access they want. Currently, my average visitor views about 5.9 pages per visit. With this knowledge I could limit a visit to say.. 20 page views a day before I redirect them. The script would redirect to a "Become a Member" page that would require registration. All the registered folks would be allowed unlimited access.
What am I missing?
The script works perfectly as is ... I'm certainly no expert on htaccess, but I've tested them extensively while implementing others ideas into mine. I've honestly never seen the RewriteBase / anywhere but here ... maybe it is technically correct, I don't know. It works fine without it though.
I have learned that there are multiple ways to do these things, and also that the slightest error in the file can screw up everything. For example, I once left out the space before the [OR] on one of the lines and the script did not block anything.
Edge,
That is actually only a small portion of my htaccess ... I have another one in my images folder to prevent people hotlinking my pics, another one in my members directory for password authentication and that blocks many IP addresses hackers have used to try to bust in ... all proxy servers.
I like your idea, but I would not know how to implement it ... it's a good idea though.
To webmasterworld: Thanks. I have made good use of your glossary and the forums. What do you think of a "dictionary of spiders"? or is that too much work for something that is not really necessary. I used superman's list myself but I wonder if there are any that I shouldn't have prevented?
I like Sam Spade.org...a common tool lots of us use here. It will let you change your ua so you can test and also do head requests to see what kind of server someone is running on. A handy tool to use for diagnosing server troubles as well.
WOW. This has really bloomed. I'm glad to see so many people finding the rewrite script handy. I can't really take credit for it, as I just peiced together what littleman, Air, Gorufu and others posted for individual situations. I'm not so concerned with the email bots as I have no email addresses in my sites...except for that IndyLibrary one. That thing will shred your site in 2 seconds flat.
I'm more concerned with things like Front Page and other "theft" bots. One of the really cool aspects of that script is the blocking of that annoying iaea.org screen scraper or what ever it is. I dont think we've really figured out precisely what it is doing (it's certainly raised my awareness of atomic issues ;) ).
I'm just like the rest of you...learning regex as I go. It's a good time for some of the *nix geeks to shine. This has really brought out one of the strengths of WMW....the collective experience of webmasters pitching in to acheive a common goal.
http://www.esalesbiz.com/extra/
I'd add it to my htaccess above to block it. It usually shows up in my logs as Website eXtractor, but I see others get it as Website Quester ... simply blocking all agents beginning with Website will take care of it.
RewriteCond %{HTTP_USER_AGENT} ^Website [OR]
Brendan