Forum Moderators: coopster & phranque

Message Too Old, No Replies

A Close to perfect .htaccess ban list

         

toolman

3:30 am on Oct 23, 2001 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Here's the latest rendition of my favorite ongoing artwork....my beloved .htaccess file. I've become quite fond of my little buddy, the .htaccess file, and I love the power it allows me to exclude vermin, pestoids and undesirable entities from my web sites

Gorufu, littleman, Air, SugarKane? You guys see any errors or better ways to do this....anybody got a bot to add....before I stick this in every site I manage.

Feel free to use this on your own site and start blocking bots too.

(the top part is left out)

<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

bull

7:48 pm on Oct 19, 2002 (gmt 0)

10+ Year Member



After visits from a s*x-spambot abusing goggle
(acb12246.ipt.aol.com - - [11/Oct/2002:14:08:51 +0200] "GET /mypoorpage.htm HTTP/1.1" 200 9075 www.mydomain.net "http://www.google.de/search?q=Guestbook+Jewel&num=100...start=400&sa=N" "Mozilla/4.0 (compatible; MSIE 5.0; Windows NT; DigExt)" "-" )
i inserted the following 2 lines. there's imho no real reason to search for guestbook, except for spambots.

RewriteCond %{HTTP_REFERER} q=guestbook [NC,OR]
RewriteCond %{HTTP_REFERER} q=g%E4stebuch [NC,OR]

works for msn search also.

jdMorgan

8:58 pm on Oct 19, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



58sniper,

Since your rewrite rules all end with the [L] flag, it doesn't really matter what order you put them in. You can choose to do the "good guys" rewrites first, or the "bad guys" first (depending on how many of each you get) in order to speed up (slightly) the majority of requests.

Only when the output from one rewrite needs to be processed by subsequent rules does the order matter all that much. In that case, you wouldn't be using the [L] flag on each ruleset.

Bull,

That's an interesting exploit - I haven't seen that one yet, but I'll keep an eye out!

Jim

58sniper

5:58 pm on Oct 24, 2002 (gmt 0)

10+ Year Member



I have an issue....

Seems the site flipdog.com has been snatching my content. Additionally, they've been doing a pretty crappy job at displaying it. (and the fact that many things on my site have changed since they've archived things, it only makes it worse) I'd like to prevent any requests from flipdog.com. This includes requestes for files from the content they've already archived, as well as any attempts to get new content.

Can someone tell me if this would work:

RewriteCond %{HTTP_REFERER} ^http://(www\.)?flipdog.com/*$ [NC]
RewriteRule ^.* /robots.php [R,L]

I already have a RewriteRule in place to protect images and other files, and that's working fine. But I'd like to prevent them from getting anything in the future. I don't see what UA they are using, so I tend to believe that they are masking that.

Should I be looking at something else to block them as well?

jdMorgan

6:55 pm on Oct 24, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



58sniper,

The "/*$" at the end of your RewriteCond pattern isn't quite right. Just leave the "$" end anchor off to match anything that starts with the pattern. Escape the 2nd period with "\" too.

I don't see any point in redirecting to your robots.txt. How about just a 403 response? Also, neither anchor is needed in the RewriteRule pattern - just ".*" will do.

RewriteCond %{HTTP_REFERER} ^http://(www\.)?flipdog\.com [NC]
RewriteRule .* - [F,L]

Returns a 403-Forbidden response and no content.

HTH,
Jim

58sniper

7:28 pm on Oct 24, 2002 (gmt 0)

10+ Year Member



Actually, I'm not redirecting to my robots.txt file, but to a file robots.php, which has some content on it.

I'll try your suggestions...

Thanks!

58sniper

2:01 pm on Oct 25, 2002 (gmt 0)

10+ Year Member


Okay, I've determined that the UA for FlipDog is

"Mozilla/4.7 (compatible; FlipDog; http://www.whizbang.com/crawler)"

in case anyone else wants to block this as well.

RewriteCond %{HTTP_USER_AGENT} ^FlipDog [OR]

should work......

121focus

11:52 pm on Oct 25, 2002 (gmt 0)

10+ Year Member



58sniper,

The "^" at the beginning of your RewriteCond pattern isn't quite right. You might try removing the "^"

RewriteCond %{HTTP_USER_AGENT} FlipDog [OR]

You can test any new RewriteCond using WannaBrowser see Message #63 [webmasterworld.com]

dhdweb

9:29 pm on Oct 26, 2002 (gmt 0)

10+ Year Member



Here is the UA list from my site, who do I need to worry about here?

Explorer¦5
Explorer¦6
sitecheck.internetseer.com (For more info see: http:¦x
Googlebot¦2
Netscape¦6
-
Netscape¦4
FAST WebCrawler¦3
Netscape¦3
Netscape¦2
(unknown)
ia_archiver
Explorer¦4
Openfind data gatherer, Openbot¦3
bumblebee¦1
Mercator 2.0
Libby_1.1¦x
Scooter 3.2.FNR
Explorer¦x
Robozilla¦1
Scooter 3.2.EX
Scooter¦3
PingALink Monitoring Services 1.0 (http:¦x
libwww perl¦5
Lachesis
Mozilla
TurnitinBot¦1
ah ha.com crawler (crawler@ah ha.com)
NationalDirectory WebSpider¦1
ScoutAbout
Pompos¦1
Gigabot¦1
NG¦1
Szukacz¦1
Opera¦6
TulipChain¦5
Scrubby¦2
AaronCarter¦1
curl¦7 4
oBot 4
Internet Explore 5.x
Microsoft URL Control 6.00.8862
Scooter 3.2
FreeFind.com SiteSearchEngine¦1
Java1.4.0
OneStop Webmaster; http:¦x
SlySearch¦1
appie 1.1 (www.walhello.com)
our agentlibwww perl¦5
Scooter ARS 1.1
Snoopy v0.1
Wget¦1
NetResearchServer¦2
Xenu Link Sleuth 1.2a
(Teradex Mapper; mapper@teradex.com; http:¦x
Microsoft URL Control 6.00.8169
Jonzilla¦6
Teleport Pro¦1
Java1.3.0
IE 5.5 Compatible Browser
Generic
Scooter 3.2.QA
metacarta (crawler@metacarta.com)
Scooter 3.2.SF0
psbot¦0
Python urllib¦1
ASPSeek¦1
Steeler¦1
Java1.3.1
W3C_Validator¦1
b2w¦0
LinkWalker
HitboxDoctor
rico¦0
Java1.3.1_02
Mewsoft Search Engine wwWebmasterWorldsoft.com¦4
asterias¦2
pavuk¦0
ColdFusion
Xenu_s Link Sleuth 1.1c
minibot
Rex Swain_s HTTP Viewer (http:¦x
Gulliver¦1
Unknown
lwp request¦2
moget¦2
Snoopy v0.94
Scooter 3.2.SB
Scooter 3.2.BT
COAST WebMaster (Windows NT)
DISCo Pump 3.2
Zeus 2895 Webster Pro V2.9 Win32
WebZIP¦5
dloader(NaverRobot)¦1
AbachoBOT (Mozilla compatible)
IP*Works! V5 HTTP¦x
MFC_Tear_Sample
NetMechanic
A WinHTTP Example Program¦1
HttpApp¦1
antibot V1.1.9¦x
http:¦x
PHP¦4
NetMechanic Page Primer
Robot: NutchCrawler, Owner: wdavies@acm.org
Sqworm¦2
Vagabondo¦2
rabaz (rabaz at gigabaz dot com)
EyeNetIE
Spinne¦2
OrangeBot
Linkbot 3.0
Net Probe

Listed here in order hits.

garwk

4:31 am on Nov 13, 2002 (gmt 0)



hello,

I doubt the use of Rewrite rules is the adequate solution to keep robots away from your files. It consumes a lot of CPU power on the server (esp. with these large lists) and still doesn't do the job very well. I have downloaded entire sites myself for several reasons and know how easy it is to fake the UA or to avoid per User traffic/max. connection limits by using a couple of proxies.

Why don't you use JacaScript for this? Most sites require it anyway. This way you can keep all the bots away from you files if you use a JavaScript Funktion to generate the URLs instead of using static URLs for links or images.

In fact I'm using this approach on my site for some time now and it works perfectly. Since this I didn't find a single download bot access pattern in my logs anymore. Apparently none of these tools are able to process Javascript.

andreasfriedrich

5:01 am on Nov 13, 2002 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Welcome to WebmasterWorld [webmasterworld.com], garwk.

Well, I guess it´s a tradeoff between security on the one hand and usability on the other hand.

If you use a JavaScript function to generate URLs on the client side you will shut out all users with Javascript turned off. And you would still need a way to make your site accessable to the SE spiders. Unless you use IP based cloaking for that a user would still be able to pose as some SE spider.

While the .htaccess approach has its shortcomings I´m still not convinced that a client side JavaScript solution would be any better.

I guess just like you can´t prevent people from copying a book you cannot prevent them from copying your website. The only thing you can do is make it harder for them.

Andreas

This 243 message thread spans 25 pages: 243