homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

This 122 message thread spans 5 pages: 122 ( [1] 2 3 4 5 > >     
A Close to perfect .htaccess ban list - Part 2

 11:46 pm on May 14, 2003 (gmt 0)

continued from [webmasterworld.com...]

UGH, bad typo in my original post. Here's the better version (I wasn't able to re-edit the older post?):

I'm trying to ban sites by domain name, since there are recently lots of reference spammers.

I have, for example, the rule:

RewriteCond %{HTTP_REFERER} ^http://(www\.)?.*stuff.*\.com/.*$ [NC]
RewriteRule ^.*$ - [F,L]

which should ban any sites containing the word "stuff"

and so on.

However, it is not working, so I am sure I did not setup a proper pattern match rule. Anyone care to advise?

[edited by: jatar_k at 5:06 am (utc) on May 20, 2003]



 2:44 am on May 20, 2003 (gmt 0)

I have been following this forum from page one through 17, and am using the tips in it to block many bad User-agents in my .htaccess file. However, I just ran into a User-agent that shows the following name; "-" (hyphen),in the raw web logs, but "<undefined>" in a hitcounter log for my trap page. I inserted <undefined> into my .htaccess filter, but a harvester slipped by it. What rule should I use in my RewriteCond rules to block the undefined (-)agent? Here is what I am thinking of adding, but want someone to confirm its correctness or otherwise:
RewriteCond %{HTTP_USER_AGENT} ^\-$ [OR]


 4:32 am on May 20, 2003 (gmt 0)


I am using

RewriteCond %{HTTP_REFERER} ^-?$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^-?$ [NC]
RewriteRule .* - [F,L]

in combination to block only cases where the UA AND Referer are empty "-" to avoid blocking innocent visitors (i.e. who are using an old version of Norton Internet Security which hides the UA).

Works well for me :-)


 4:56 am on May 20, 2003 (gmt 0)


Thanx for the reply. I first used Sam Spade to do a lookup on the IP in question, to be sure it resided somewhere that would not usually have business with me. It is based in Hong Kong, at this IP:, which is listed in several blacklists. The agent only indexed my entry page and my guestbook. I have seen a couple of similar entries in the past, with an <undefined> user agent id, so I will be adding your regexp for a referrer and user-agent of ^-$ .


 5:20 am on May 20, 2003 (gmt 0)


What you have should work. Cleaning it up and adding the "support" stuff to the beginning:

Options +FollowSymLinks
RewriteEngine on
RewriteCond %{HTTP_REFERER} ^http://(www\.)?.*stuff.*\.com [NC]
RewriteRule .* - [F]

If you already have the Options and RewritEngine on directives in your file, then how is it not working? Not blocking the visits? Server errors? etc.



 5:32 am on May 20, 2003 (gmt 0)

jdMorgan, yes, I have Options and RewriteEngine both on top of the htaccess file.

The reason I thought it is not working is that I had a hard time blocking one nasty spammer. The spammer always produces the following kind of access_log entries: - - [18/May/2003:07:51:20 -0400] "GET / HTTP/1.1" 403 210 "http://www.some-bad-word-and-more.com\r" "http://www.www.some-bad-word-and-more.com\r"

I blocked first by ip:

RewriteCond %{REMOTE_ADDR} ^216\.169\.111\.
RewriteRule ^.*$ - [F,L]

but it still let them in (?), so I tried blocking by word as in my previous email.

Yesterday, I traced the IP address down and found it's hosting company. I complained there and they said they would do something about it. I yet have to see.

Btw, would this also work to block specific urls:

RewriteCond %{HTTP_REFERER} stuff [NC]
RewriteRule .* - [F,L]


 5:51 am on May 20, 2003 (gmt 0)


Yes, that would work, but [L] is redundant when used with [F].

RewriteCond %{HTTP_REFERER} stuff [NC]
RewriteRule .* - [F]

In this case, there is not much difference, but do try where possible to use anchored patterns; they can be tested much faster than unanchored patterns.

Notice that I took the "/" off the end of your "stuff.*\.com" above. The reason for this is that there might be a port number appended to the domain, in which case the character following ".com" won't be a "/" and this may be why your rule did not stop him.



 6:09 am on May 20, 2003 (gmt 0)

Jim, thanks for the quick replies, very much appreciated!

I modified the htaccess file according to your directions. I was suprised,though, that the [F,L] part should be written as [F], since I see the former in so many .htaccess file samples (such as Mark Pilgrim's one here [diveintomark.org])


[edited by: Woz at 11:36 am (utc) on May 20, 2003]
[edit reason] shortened URL [/edit]


 6:13 am on May 20, 2003 (gmt 0)

Why is the L (Last command) redundant here? Please clarify for us, as we have seen it used so many times.


 11:16 am on May 20, 2003 (gmt 0)


I generally disallow search engine queries directly to the guestbook on our server (who searches for guestbooks via search engine?). To accomplish this I use the following rule:

RewriteCond %{HTTP_REFERER} q=guestbook [NC,OR]

Maybe this helps further ;-)


 3:14 pm on May 20, 2003 (gmt 0)

> I was suprised,though, that the [F,L] part should be written as [F], since I see the former in so many .htaccess file samples (such as Mark Pilgrim's one here)

> Why is the L (Last command) redundant here? Please clarify for us, as we have seen it used so many times.

Take a look at the following documents, the descriptions of the [F] and [G] flags, and the examples provided in Rewriting guide:

Apache Module mod_rewrite - URL Rewriting Engine [httpd.apache.org]
Apache URL Rewriting Guide [httpd.apache.org]

Look for the word "immediately" in the descriptions of [F] and [G], and compare to its use in the description of [L].

I too have seen many more "incorrect" and/or inefficient rewrites than I have "perfect" ones. I have also used and posted [F,L]-terminated rules myself, both out of early unfamiiarity with mod_rewrite and later, out of old (bad) habit. Then there's the issue with ".*" at the beginning or end of an unachored pattern. In both cases, these "mistakes" won't stop the rule from working, they just slow it down. I'd rather have a fast web than a slow web, so I point them out. They are not even real mistakes; rather, they're more like "bad style" (no offense intended here, just trying to make a point).

One secret to success with mod_rewrite is simply this: Print out the cited documents, and put them somewhere where you are likely to read them (three guesses where I keep a copy!). Then, whenever you're.... erm, sitting there, pick 'em up and read 'em. Do this until you can find the page you need by feel, or until you have to print out a second copy because the first one falls apart from wear. :)

I also like the concise regular-expressions tutorial cited in DaveAtIFG's bookmark-worthy Introduction to mod_rewrite [webmasterworld.com] post.



 10:35 pm on May 20, 2003 (gmt 0)

Very helpful comment, Jim. I printed the refered material out and am off to a very relaxing location to read them...


 9:57 pm on May 21, 2003 (gmt 0)

I too have seen many more "incorrect" and/or inefficient rewrites than I have "perfect" ones. I have also used and posted [F,L]-terminated rules myself, both out of early unfamiiarity with mod_rewrite and later, out of old (bad) habit. Then there's the issue with ".*" at the beginning or end of an unachored pattern.

If I understand you correctly, then the final rewrite expression should read thusly: ^.*$ - [F]
Is this correct?


 10:48 pm on May 21, 2003 (gmt 0)


No, just
RewriteRule .* - [F]
will do - there is no need to start- or end-anchor a pattern which is completely wild-carded.

What I was talking about above is this:
can just as easily be written

can be shortened to

There is no need to anchor a pattern if the characters adjacent to that anchor are wild-cards. Fewer unneeded characters means smaller files and faster regex processing.

Ref: A concise Regular Expressions Tutorial [etext.lib.virginia.edu]



 11:28 pm on May 21, 2003 (gmt 0)


Thanks for that explanation. It is close to what I had, just minus the ",L" in the brackets. I misunderstood your reference to unanchored wildcards.

I just received my copy of Mastering Regular Expressions 2nd Edition, and Writing Apache Modules with Perl.

I have a better understanding about the use of Mod Rewrites now. Is this the best sub-forum to make inquiries about other .htaccess commands, or should I post them elsewhere on the boards?


 12:36 am on May 22, 2003 (gmt 0)


I think most .htaccess discussions take place here in Website Technology Issues or over in the Perl and PHP CGI Scripting forum (e.g., for SE-friendly-URL rewrites), and there's also some action in Search Engine Spider Identification (bad-bot & spider-blocking/redirecting), and Tracking and Logging (general-UA or IP blocking/redirecting). It's application- and poster-focus- dependent.

It seems like basic URL redirection for renamed files - discussions of 301 redirects, for example - are all over the place, including the (usually-inappropriate) Google forum, just depending on who's panicking and why. ;)

Heck, I can't figure out which forum is "just right" for a subject half the time, and I've actually read quite a few of the forum charters! If something is WAAAY out of line the mods will move it, though I try not to make work for them (Thanks, mods!). Starting with a site-search, you can usually figure out which forum contains the most on-topic discussion of a particular subject area or application.



 8:10 pm on Jun 9, 2003 (gmt 0)

I'm just trying to make sure I've got this working right. So here it goes:

There are only two bots (at this point) I'm trying to block. I also have a lot of 301 redirects. So, this is what I have that doesn't generate ANY errors and _seems_ to work.

Options +FollowSymLinks
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} "Indy Library"
RewriteCond %{HTTP_USER_AGENT} "IUPUI Research Bot"
RewriteRule .* - [F,L]
RewriteRule ^links/partners\.html$ [widgets.com...] [R=301]
More rules and stuff that work as they should follow of course. My concern is the first five lines. First, it appears VALID but will it _WORK_? Second, is this the most efficient way of doing this? Last, do I need to add anything?

Thanks in advance for any and all help.


 10:48 pm on Jun 9, 2003 (gmt 0)

Options +FollowSymLinks
RewriteEngine on
RewriteCond %{HTTP_USER_AGENT} "Indy Library"
RewriteCond %{HTTP_USER_AGENT} "IUPUI Research Bot"
RewriteRule .* - [F,L]
RewriteRule ^links/partners\.html$ [widgets.com...] [R=301]


First of all you have two Rewrite conditions that require either OR to match the pattern, so you need to add [or] after Indy Library. Second, spaces in names must be escaped with a backslash, thusly:
Indy\ Library [or]
^IUPUI\ Research\ Bot
Third, drop the L in the rewrite rule; .* - [F]
Fourth, in the last rule you don't need the $ delimiter. It can be retyped as:
Rewrite ^links/partners\.html [widgets.com...] [R=301,L]

It may not require the leading ^ either, but I'm not certain. Try it without the ^ and see if it works using [wannabrowser.com ]. You are really redirecting here, so a "redirect" rule might prove more correct, but I'm not advanced enough to say for sure.

I hope this is helpful


 12:28 am on Jun 10, 2003 (gmt 0)

Oaf357 & Wiz,

I'd leave that "$" in there, and do not remove the "^" ... Fully-anchored string compares are faster.
Otherwise, looks good!



 11:54 pm on Jun 12, 2003 (gmt 0)

newbie here, flailing about ...

I've been reading these threads, the SESpider Ident and the Perl forums for about a month, on and off, whenever I can find the time and it seems the more I learn the more confuseder I get :)

I want to set up a spider trap but thought it better to get .htaccess working right first since the trap will be another new learning phase.

This is small sample of my .htaccess. Most of what I have does the trick, but not for the six below. I searched for "Indy" and "Microsoft URL Control" and have tried every variation I found in the 'close to perfect' threads, but probably not sequentially which may be the reason they don't work. Would these be correct without the preceeding "^"? And, should Indy Library be changed to "Indy Library" inside qoutes and without escaping the space?

RewriteEngine On
RewriteCond %{REMOTE_ADDR} "^63\.148\.99\.2(2[4-9]¦[3-4][0-9]¦5[0-5])$" [OR]
#lots of UA's......
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL\ Control [OR]
RewriteCond %{HTTP_USER_AGENT} ^webcollage [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus [OR]
RewriteRule!^403.htm$ - [F]
Deny from
RedirectPermanent /flename.htm [mydomain.com...]

One more question, is the 'Deny from' the correct syntax to deny an IP? My hosting service said I didn't need anything above or below it.



 12:19 am on Jun 13, 2003 (gmt 0)


The problem is that you have an [OR] flag on your last RewriteCond. Since [OR]s only apply to RewriteConds, this is blowing up your RewriteRule.

Your syntax is otherwise fine as-is. Do not confuse prefix-matching syntax, as used in Redirect and Deny directives, with the extended regular expressions pattern-matching used by mod_rewrite. Beware of the many examples posted here with incorrect start and/or end anchoring. The "^" and "$" symbols have very specific functions, and adding or removing either of them will change the pattern-matching drastically.

The regular expressions tutorial linked from this Introduction to mod_rewrite [webmasterworld.com] post is quite helpful.

Sorry for any typos and terseness - typing in a hurry.


<edit>Oh, and make sure you have a space preceding the exclamation point in any RewriteRules or RewriteConds.</edit>


 4:58 am on Jun 13, 2003 (gmt 0)

Thanks for the quick response, Jim.

you have an [OR] flag on your last RewriteCond.

a space preceding the exclamation point

Both of those were sloppy copy/pasting - the last line actually was a different UA and didn't have the [OR] flag. I've no clue as to where the space before the "!" went as it's in the .htaccess

The RewriteRule is working for most of the conditions because I see them getting the 403, its just not working for those I mentioned.

two hours later .....

Whoo hoo - I just noticed that almost every line in my .htaccess ends with a space after the [OR] flag. I cleaned those up, maybe that's the problem. I'm really sick of looking at this file and now wonder how any of the conditions worked.


 5:29 am on Jun 13, 2003 (gmt 0)


> I've no clue as to where the space before the "!" went as it's in the .htaccess

This forum eats those spaces, but it's never clear when viewing copied code whether the poster had them in there and the forum ate them, or they were missing from the original code. To get the excalmation points to stay spaced when posting on WebmasterWorld, you have to use two spaces.

> The RewriteRule is working for most of the conditions because I see them getting the 403, its just not working for those I mentioned.

For those UAs I block and know about, your code is correct - including pattern anchoring. The pattern in the RewriteCond must be a letter-perfect match for the UA you see in your raw log files in order to work. Exceptions are when using the [NC] flag to make the compare case-insensitive, and of course, the use of regex wild-card characters or strings.

One other thing that messes things up is if an [OR] flag is missing on a RewriteCond line preceding one that appears to be broken.

I hope it was the spaces, 'causee otherwise, I'm stumped.

If you're sick of your .htaccess, you wouldn't want to see mine! - I do up to dozen UA's per RewriteCond. ;)



 5:44 am on Jun 13, 2003 (gmt 0)

Since you know my code is correct, it must be the spaces. I'll post back here if it was and hopefully keep some other htaccess newbie from making the same mistake.

you wouldn't want to see mine! - I do up to dozen UA's per

eeeek, I'd be blind. Now I fully understand your water closet trick ;)

thanks for the tip about the "!" and spaces

yeah, I discovered what happens if you mess up an [OR] flag - every page went 500 when I dropped the final "]" on one of the conditions.


 3:25 am on Jun 17, 2003 (gmt 0)

Added a new one:

RewriteCond %{HTTP_USER_AGENT} ^NASA\ Search\ 1\.0$ [NC,OR]

Went straight for the guestbook. It will suck down 403s now.

Does anyone have a "complete", up to date ban list that they could either post, sticky, or link to? I'd like to know what I'm up against. Everyday I add more bots to my list.


 4:28 am on Jun 17, 2003 (gmt 0)
my bot list is rather large... i don't know how accurate it is, though... i can say that i don't have the problems, today, that i had a while back...

as for posting it, i'm not sure of the best way to make it available... i could link it from my site or i could just post it in a message... i'm sure there are plenty of corrections or optimizations that could be made to it, though... hummm...

ok, take it with the understanding that you have to determine what bots you want to allow access to your site... some of these i have blocked, you may want to allow on... others, you may want to block... i can't say that these are all-inclusive or that i haven't messed something up somewhere along the lines... also note that some of this and the associated comments are by others that have posted here and on other forums... i am thankful for their contributions but, sadly, i don't have any notes as to who they were ;-(

===== snip =====

Options +FollowSymLinks
RewriteEngine on
RewriteBase /

# this ruleset is to "stop" stupid attempts to use MS IIS expolits on us
RewriteCond %{REQUEST_URI} /(cmd¦root¦shell)\.exe$[NC,OR]
RewriteCond %{REQUEST_URI} /(admin¦httpodbc)\.dll$[NC]
RewriteRule .* /cgi-bin/nonimda.cmd [L,E=HTTP_USER_AGENT:NIMDA_EXPLOIT,T=application/x-httpd-cgi]

RewriteCond %{REQUEST_URI} /default\.(ida¦idq)$[NC,OR]
RewriteCond %{REQUEST_URI} /.*\.printer$[NC]
RewriteRule .* /cgi-bin/nocode-r.cmd [L,E=HTTP_USER_AGENT:CODERED_EXPLOIT,T=application/x-httpd-cgi]

# this ruleset is for formmail script abusers...
RewriteCond %{REQUEST_URI} formmail\.(pl¦cgi)$[NC,OR]
RewriteCond %{REQUEST_URI} mailto\.(exe¦cgi)$[NC]
RewriteRule .* /cgi-bin/nofrmml.cmd [L,E=HTTP_USER_AGENT:FORMMAIL_EXPLOIT,T=application/x-httpd-cgi]

# Cyveillance is a spybot that scours the web for copyright violations and “damaging information” on
# behalf of clients such as the RIAA and MPAA. Their robot spoofs its User-Agent to look like Internet
# Explorer, and it completely ignores robots.txt. I have
# banned it by IP address.
RewriteCond %{REMOTE_ADDR} "^63\.148\.99\.2(2[4-9]¦[3-4][0-9]¦5[0-5])$"
RewriteRule .* - [F]

# There is another email harvester which always claims to be referred from http://www.iaea.org/.
# You may have seen this in your own referrer pages.
# I have banned it by referrer.
RewriteCond %{HTTP_REFERER} iaea\.org[NC]
RewriteRule .* - [F]

# NameProtect peddles their “online brand monitoring” to unsuspecting and gullible companies
# looking for people to sue. Despite the claims on their robot information page, they do not
# respect robots.txt; in fact, they spoof their User-Agent in multiple ways to avoid detection.
# I have banned them by User-Agent and IP address.
RewriteCond %{REMOTE_ADDR} ^12\.148\.196\.(12[8-9]¦1[3-9][0-9]¦2[0-4][0-9]¦25[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^12\.148\.209\.(19[2-9]¦2[0-4][0-9]¦25[0-5])$ [OR]
RewriteCond %{HTTP_USER_AGENT} NPBot[NC]
RewriteRule .* - [F]

# this ruleset is for unwanted useragents... possibly email harvesters
RewriteCond %{HTTP_USER_AGENT} ^[A-Z]+$[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.Browse\s[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.Eval[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.Surf [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Harvest [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*HTTrack [NC,OR]
# RewriteCond %{HTTP_USER_AGENT} ^.*libwww-perl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*LWP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*prospector[NC,OR]
RewriteCond %{HTTP_USER_AGENT} AsiaNetBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} attache [NC,OR]
RewriteCond %{HTTP_USER_AGENT} autohttp [NC,OR]
RewriteCond %{HTTP_USER_AGENT} bew [NC,OR]
RewriteCond %{HTTP_USER_AGENT} BlackWidow [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bot\ mailto:craftbot@yahoo.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bullseye [NC,OR]
RewriteCond %{HTTP_USER_AGENT} CherryPicker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ChinaClaw[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Crescent [NC,OR]
RewriteCond %{HTTP_USER_AGENT} curl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} devsoft's\ http\ component [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Deweb[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Digimarc [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Digger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} digout4uagent[NC,OR]
RewriteCond %{HTTP_USER_AGENT} DIIbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} dloader(NaverRobot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Download\ Demon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} eCatch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ecollector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Educate\ Search [NC,OR]
RewriteCond %{HTTP_USER_AGENT} EirGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} EmailCollector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} EmailSiphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} EmailWolf[NC,OR]
RewriteCond %{HTTP_USER_AGENT} EO\ Browse [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Express\ WebPictures[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ExtractorPro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} EyeNetIE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} fastlwspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} FEZhead[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Fetch[NC,OR]
RewriteCond %{HTTP_USER_AGENT} FlashGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Franklin\ Locator[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Full\ Web\ Bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Getleft [NC,OR]
RewriteCond %{HTTP_USER_AGENT} GetRight [NC,OR]
RewriteCond %{HTTP_USER_AGENT} GetWebPage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Go!Zilla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Gozilla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} go-ahead-got-it [NC,OR]
RewriteCond %{HTTP_USER_AGENT} GrabNet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Grafula [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HMView [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTML\ Works [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
# RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} IBM_Planetwide [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Image\ Stripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Image\ Sucker[NC,OR]
RewriteCond %{HTTP_USER_AGENT} IncyWincy[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Industry\ Program[NC,OR]
RewriteCond %{HTTP_USER_AGENT} InterGET [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Internet\ Explore\ 5\.x [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Internet\ Ninja [NC,OR]
RewriteCond %{HTTP_USER_AGENT} InternetSeer.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Irvine [NC,OR]
RewriteCond %{HTTP_USER_AGENT} JetCar [NC,OR]
RewriteCond %{HTTP_USER_AGENT} JOC\ Web\ Spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} KWebGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} leech[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mass\ Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MCspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Microsoft\ URL [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MIDown\ tool [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mirror [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Missauga\ Locator[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Missigua\ Locator[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mister\ PiX [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Monster [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mozilla.*NEWT[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mozilla\/3\.0\.\+Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mozilla\/3.Mozilla\/2\.01 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mozilla\/4\.0$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mozzilla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MSIECrawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Navroad [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NearSite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NetAnts [NC,OR]
RewriteCond %{HTTP_USER_AGENT} netattache [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NetCarta [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NetSpider[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Net\ Vampire [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Octopus [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Offline\ Explorer[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Offline\ Navigator [NC,OR]
RewriteCond %{HTTP_USER_AGENT} OpaL [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Openfind [NC,OR]
RewriteCond %{HTTP_USER_AGENT} OpenTextSiteCrawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PackRat [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PageGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Papa\ Foto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} pavuk[NC,OR]
RewriteCond %{HTTP_USER_AGENT} pcBrowser[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Plucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Production\ Bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Program\ Shareware [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PushSite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} RealDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ReGet[NC,OR]
RewriteCond %{HTTP_USER_AGENT} RepoMonkey [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Rover[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Rsync[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Siphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ScoutAbout [NC,OR]
RewriteCond %{HTTP_USER_AGENT} searchterms\.it [NC,OR]
RewriteCond %{HTTP_USER_AGENT} semanticdiscovery[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Shai [NC,OR]
RewriteCond %{HTTP_USER_AGENT} sitecheck[NC,OR]
RewriteCond %{HTTP_USER_AGENT} SiteSnagger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SmartDownload[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Spegla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SpiderBot[NC,OR]
RewriteCond %{HTTP_USER_AGENT} SuperBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Surfbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SurfWalker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} tAkeOut [NC,OR]
RewriteCond %{HTTP_USER_AGENT} tarspider[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Teleport\ Pro[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Telesoft [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Templeton[NC,OR]
RewriteCond %{HTTP_USER_AGENT} UtilMind [NC,OR]
RewriteCond %{HTTP_USER_AGENT} VoidEYE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} w3mir[NC,OR]
RewriteCond %{HTTP_USER_AGENT} web.by.mail [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebBandit[NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCopier[NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCopy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebEMailExtrac [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Web\ Image\ Collector[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Web\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebAuto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCopier[NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebFetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebMiner [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebReaper[NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebSauger[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Website\ eXtractor [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Website\ Quester [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebSnake [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebStripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} webvac [NC,OR]
RewriteCond %{HTTP_USER_AGENT} webwalk [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebWhacker [NC,OR]
# RewriteCond %{HTTP_USER_AGENT} wget [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WhosTalking [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Widow[NC,OR]
RewriteCond %{HTTP_USER_AGENT} www\.pl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Xaldon\ WebSpider[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Yandex [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Zeus.*Webster[NC]
#RewriteCond %{HTTP_USER_AGENT} test[NC]
RewriteCond %{REQUEST_URI}!^/badUA\.html [NC]
RewriteRule .* /badUA.html [L,E=HTTP_USER_AGENT:BAD_USER_AGENT]

# this ruleset is to stop blank user agents with blank referrers
RewriteCond %{HTTP_REFERER} ^-?$
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteRule .* /cgi-bin/noagent.cmd [L,T=application/x-httpd-cgi]

===== snip =====

there're quite a few in there... watch out for hosing your server... i got mine caught in endless loops several times while adjusting this from site wide (internal to httpd.conf) to per directory (.htaccess)... was glad i run my own server :wink:

a final note... watch for missing spaces... ther should be a space before every [ and the ¦ must be replaced by the verticle pipe on your keyboard... this site strips out extra spaces and tabs and replaces the split verticle pipe by a solid one... you'll have to watch these things...

FWIW: the above is taken directly, with no modification, from one of my main site .htaccess files... this site is live and online at this time with the above...



 6:34 pm on Jun 20, 2003 (gmt 0)

Tamsy wrote:

RewriteCond %{HTTP_REFERER} ^-?$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^-?$ [NC]
RewriteRule .* - [F,L]

My question is, can ^$ safely replace ^-?$? I ask because I used cPanel to write part of my .htaccess file. To prevent hotlinking it denies gif, png's etc when the referrer is!^http://myserver,!^http://www.myserver and!^$.

Isn't!^$ the same as "-"? Or am I wrong?


 9:27 pm on Jun 20, 2003 (gmt 0)


^$ means "empty"

^-?$ means "may contain only a single '-' character, but the '-' character is not required." Or, in other words, "either blank or contains a single '-' character."

In the code posted above, we are looking for someone wishing to bypass a block for empty user-agent string by using a "-" character as their user-agent. In common log format, the log entry for a blank user-agent and a user-agent of "-" would appear identical.

So the code above blocks either blank user-agents, or "fake" blank user-agents.

Ref: [etext.lib.virginia.edu...]



 9:43 pm on Jul 4, 2003 (gmt 0)

Have any of you seen WebCapture 2.0? It was on my site today and is potentially a capture bot.

I run an htaccess file I developed from this thread. Thank you for such great sharing! It has saved me an amazing amount of bandwidth and headaches:)


 11:35 pm on Jul 4, 2003 (gmt 0)


Welcome to WebmasterWorld [webmasterworld.com]!

We've had a recent spotting of Webcature 3.0 - which may or may not be the same thing - over in the Search Engine Spider Identification forum [webmasterworld.com].

The fact that 3.0 doesn't fetch robots.txt is a bad sign...


This 122 message thread spans 5 pages: 122 ( [1] 2 3 4 5 > >
Global Options:
 top home search open messages active posts  

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved