Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.
Feel free to use this on your own site and start blocking bots too.
(the top part is left out)<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]
RewriteCond %{HTTP_REFERER} q=guestbook [NC,OR]
RewriteCond %{HTTP_REFERER} q=g%E4stebuch [NC,OR]
works for msn search also.
Since your rewrite rules all end with the [L] flag, it doesn't really matter what order you put them in. You can choose to do the "good guys" rewrites first, or the "bad guys" first (depending on how many of each you get) in order to speed up (slightly) the majority of requests.
Only when the output from one rewrite needs to be processed by subsequent rules does the order matter all that much. In that case, you wouldn't be using the [L] flag on each ruleset.
Bull,
That's an interesting exploit - I haven't seen that one yet, but I'll keep an eye out!
Jim
Seems the site flipdog.com has been snatching my content. Additionally, they've been doing a pretty crappy job at displaying it. (and the fact that many things on my site have changed since they've archived things, it only makes it worse) I'd like to prevent any requests from flipdog.com. This includes requestes for files from the content they've already archived, as well as any attempts to get new content.
Can someone tell me if this would work:
RewriteCond %{HTTP_REFERER} ^http://(www\.)?flipdog.com/*$ [NC]
RewriteRule ^.* /robots.php [R,L]
I already have a RewriteRule in place to protect images and other files, and that's working fine. But I'd like to prevent them from getting anything in the future. I don't see what UA they are using, so I tend to believe that they are masking that.
Should I be looking at something else to block them as well?
The "/*$" at the end of your RewriteCond pattern isn't quite right. Just leave the "$" end anchor off to match anything that starts with the pattern. Escape the 2nd period with "\" too.
I don't see any point in redirecting to your robots.txt. How about just a 403 response? Also, neither anchor is needed in the RewriteRule pattern - just ".*" will do.
RewriteCond %{HTTP_REFERER} ^http://(www\.)?flipdog\.com [NC]
RewriteRule .* - [F,L]
Returns a 403-Forbidden response and no content.
HTH,
Jim
The "^" at the beginning of your RewriteCond pattern isn't quite right. You might try removing the "^"
RewriteCond %{HTTP_USER_AGENT} FlipDog [OR]
You can test any new RewriteCond using WannaBrowser see Message #63 [webmasterworld.com]
Explorer¦5
Explorer¦6
sitecheck.internetseer.com (For more info see: http:¦x
Googlebot¦2
Netscape¦6
-
Netscape¦4
FAST WebCrawler¦3
Netscape¦3
Netscape¦2
(unknown)
ia_archiver
Explorer¦4
Openfind data gatherer, Openbot¦3
bumblebee¦1
Mercator 2.0
Libby_1.1¦x
Scooter 3.2.FNR
Explorer¦x
Robozilla¦1
Scooter 3.2.EX
Scooter¦3
PingALink Monitoring Services 1.0 (http:¦x
libwww perl¦5
Lachesis
Mozilla
TurnitinBot¦1
ah ha.com crawler (crawler@ah ha.com)
NationalDirectory WebSpider¦1
ScoutAbout
Pompos¦1
Gigabot¦1
NG¦1
Szukacz¦1
Opera¦6
TulipChain¦5
Scrubby¦2
AaronCarter¦1
curl¦7 4
oBot 4
Internet Explore 5.x
Microsoft URL Control 6.00.8862
Scooter 3.2
FreeFind.com SiteSearchEngine¦1
Java1.4.0
OneStop Webmaster; http:¦x
SlySearch¦1
appie 1.1 (www.walhello.com)
our agentlibwww perl¦5
Scooter ARS 1.1
Snoopy v0.1
Wget¦1
NetResearchServer¦2
Xenu Link Sleuth 1.2a
(Teradex Mapper; mapper@teradex.com; http:¦x
Microsoft URL Control 6.00.8169
Jonzilla¦6
Teleport Pro¦1
Java1.3.0
IE 5.5 Compatible Browser
Generic
Scooter 3.2.QA
metacarta (crawler@metacarta.com)
Scooter 3.2.SF0
psbot¦0
Python urllib¦1
ASPSeek¦1
Steeler¦1
Java1.3.1
W3C_Validator¦1
b2w¦0
LinkWalker
HitboxDoctor
rico¦0
Java1.3.1_02
Mewsoft Search Engine wwWebmasterWorldsoft.com¦4
asterias¦2
pavuk¦0
ColdFusion
Xenu_s Link Sleuth 1.1c
minibot
Rex Swain_s HTTP Viewer (http:¦x
Gulliver¦1
Unknown
lwp request¦2
moget¦2
Snoopy v0.94
Scooter 3.2.SB
Scooter 3.2.BT
COAST WebMaster (Windows NT)
DISCo Pump 3.2
Zeus 2895 Webster Pro V2.9 Win32
WebZIP¦5
dloader(NaverRobot)¦1
AbachoBOT (Mozilla compatible)
IP*Works! V5 HTTP¦x
MFC_Tear_Sample
NetMechanic
A WinHTTP Example Program¦1
HttpApp¦1
antibot V1.1.9¦x
http:¦x
PHP¦4
NetMechanic Page Primer
Robot: NutchCrawler, Owner: wdavies@acm.org
Sqworm¦2
Vagabondo¦2
rabaz (rabaz at gigabaz dot com)
EyeNetIE
Spinne¦2
OrangeBot
Linkbot 3.0
Net Probe
Listed here in order hits.
I doubt the use of Rewrite rules is the adequate solution to keep robots away from your files. It consumes a lot of CPU power on the server (esp. with these large lists) and still doesn't do the job very well. I have downloaded entire sites myself for several reasons and know how easy it is to fake the UA or to avoid per User traffic/max. connection limits by using a couple of proxies.
Why don't you use JacaScript for this? Most sites require it anyway. This way you can keep all the bots away from you files if you use a JavaScript Funktion to generate the URLs instead of using static URLs for links or images.
In fact I'm using this approach on my site for some time now and it works perfectly. Since this I didn't find a single download bot access pattern in my logs anymore. Apparently none of these tools are able to process Javascript.
Well, I guess it´s a tradeoff between security on the one hand and usability on the other hand.
If you use a JavaScript function to generate URLs on the client side you will shut out all users with Javascript turned off. And you would still need a way to make your site accessable to the SE spiders. Unless you use IP based cloaking for that a user would still be able to pose as some SE spider.
While the .htaccess approach has its shortcomings I´m still not convinced that a client side JavaScript solution would be any better.
I guess just like you can´t prevent people from copying a book you cannot prevent them from copying your website. The only thing you can do is make it harder for them.
Andreas