Forum Moderators: phranque
I'm trying to ban sites by domain name, since there are recently lots of reference spammers.
I have, for example, the rule:
RewriteCond %{HTTP_REFERER} ^http://(www\.)?.*stuff.*\.com/.*$ [NC]
RewriteRule ^.*$ - [F,L]
which should ban any sites containing the word "stuff"
www.stuff.com
www.whatkindofstuff.com
www.some-other-stuff.com
and so on.
However, it is not working, so I am sure I did not setup a proper pattern match rule. Anyone care to advise?
[edited by: jatar_k at 5:06 am (utc) on May 20, 2003]
The problem is that you have an [OR] flag on your last RewriteCond. Since [OR]s only apply to RewriteConds, this is blowing up your RewriteRule.
Your syntax is otherwise fine as-is. Do not confuse prefix-matching syntax, as used in Redirect and Deny directives, with the extended regular expressions pattern-matching used by mod_rewrite. Beware of the many examples posted here with incorrect start and/or end anchoring. The "^" and "$" symbols have very specific functions, and adding or removing either of them will change the pattern-matching drastically.
The regular expressions tutorial linked from this Introduction to mod_rewrite [webmasterworld.com] post is quite helpful.
Sorry for any typos and terseness - typing in a hurry.
Jim
<edit>Oh, and make sure you have a space preceding the exclamation point in any RewriteRules or RewriteConds.</edit>
you have an [OR] flag on your last RewriteCond.
a space preceding the exclamation point
Both of those were sloppy copy/pasting - the last line actually was a different UA and didn't have the [OR] flag. I've no clue as to where the space before the "!" went as it's in the .htaccess
The RewriteRule is working for most of the conditions because I see them getting the 403, its just not working for those I mentioned.
two hours later .....
Whoo hoo - I just noticed that almost every line in my .htaccess ends with a space after the [OR] flag. I cleaned those up, maybe that's the problem. I'm really sick of looking at this file and now wonder how any of the conditions worked.
> I've no clue as to where the space before the "!" went as it's in the .htaccess
This forum eats those spaces, but it's never clear when viewing copied code whether the poster had them in there and the forum ate them, or they were missing from the original code. To get the excalmation points to stay spaced when posting on WebmasterWorld, you have to use two spaces.
> The RewriteRule is working for most of the conditions because I see them getting the 403, its just not working for those I mentioned.
For those UAs I block and know about, your code is correct - including pattern anchoring. The pattern in the RewriteCond must be a letter-perfect match for the UA you see in your raw log files in order to work. Exceptions are when using the [NC] flag to make the compare case-insensitive, and of course, the use of regex wild-card characters or strings.
One other thing that messes things up is if an [OR] flag is missing on a RewriteCond line preceding one that appears to be broken.
I hope it was the spaces, 'causee otherwise, I'm stumped.
If you're sick of your .htaccess, you wouldn't want to see mine! - I do up to dozen UA's per RewriteCond. ;)
Jim
you wouldn't want to see mine! - I do up to dozen UA's per
eeeek, I'd be blind. Now I fully understand your water closet trick ;)
thanks for the tip about the "!" and spaces
yeah, I discovered what happens if you mess up an [OR] flag - every page went 500 when I dropped the final "]" on one of the conditions.
RewriteCond %{HTTP_USER_AGENT} ^NASA\ Search\ 1\.0$ [NC,OR]
Went straight for the guestbook. It will suck down 403s now.
Does anyone have a "complete", up to date ban list that they could either post, sticky, or link to? I'd like to know what I'm up against. Everyday I add more bots to my list.
as for posting it, i'm not sure of the best way to make it available... i could link it from my site or i could just post it in a message... i'm sure there are plenty of corrections or optimizations that could be made to it, though... hummm...
ok, take it with the understanding that you have to determine what bots you want to allow access to your site... some of these i have blocked, you may want to allow on... others, you may want to block... i can't say that these are all-inclusive or that i haven't messed something up somewhere along the lines... also note that some of this and the associated comments are by others that have posted here and on other forums... i am thankful for their contributions but, sadly, i don't have any notes as to who they were ;-(
===== snip =====
Options +FollowSymLinks
RewriteEngine on
RewriteBase /
# this ruleset is to "stop" stupid attempts to use MS IIS expolits on us
# NIMDA
RewriteCond %{REQUEST_URI} /(cmd¦root¦shell)\.exe$[NC,OR]
RewriteCond %{REQUEST_URI} /(admin¦httpodbc)\.dll$[NC]
RewriteRule .* /cgi-bin/nonimda.cmd [L,E=HTTP_USER_AGENT:NIMDA_EXPLOIT,T=application/x-httpd-cgi]
# CODERED
RewriteCond %{REQUEST_URI} /default\.(ida¦idq)$[NC,OR]
RewriteCond %{REQUEST_URI} /.*\.printer$[NC]
RewriteRule .* /cgi-bin/nocode-r.cmd [L,E=HTTP_USER_AGENT:CODERED_EXPLOIT,T=application/x-httpd-cgi]
# this ruleset is for formmail script abusers...
RewriteCond %{REQUEST_URI} formmail\.(pl¦cgi)$[NC,OR]
RewriteCond %{REQUEST_URI} mailto\.(exe¦cgi)$[NC]
RewriteRule .* /cgi-bin/nofrmml.cmd [L,E=HTTP_USER_AGENT:FORMMAIL_EXPLOIT,T=application/x-httpd-cgi]
# Cyveillance is a spybot that scours the web for copyright violations and “damaging information” on
# behalf of clients such as the RIAA and MPAA. Their robot spoofs its User-Agent to look like Internet
# Explorer, and it completely ignores robots.txt. I have
# banned it by IP address.
RewriteCond %{REMOTE_ADDR} "^63\.148\.99\.2(2[4-9]¦[3-4][0-9]¦5[0-5])$"
RewriteRule .* - [F]
# There is another email harvester which always claims to be referred from http://www.iaea.org/.
# You may have seen this in your own referrer pages.
# I have banned it by referrer.
RewriteCond %{HTTP_REFERER} iaea\.org[NC]
RewriteRule .* - [F]
# NameProtect peddles their “online brand monitoring” to unsuspecting and gullible companies
# looking for people to sue. Despite the claims on their robot information page, they do not
# respect robots.txt; in fact, they spoof their User-Agent in multiple ways to avoid detection.
# I have banned them by User-Agent and IP address.
RewriteCond %{REMOTE_ADDR} ^12\.148\.196\.(12[8-9]¦1[3-9][0-9]¦2[0-4][0-9]¦25[0-5])$ [OR]
RewriteCond %{REMOTE_ADDR} ^12\.148\.209\.(19[2-9]¦2[0-4][0-9]¦25[0-5])$ [OR]
RewriteCond %{HTTP_USER_AGENT} NPBot[NC]
RewriteRule .* - [F]
# this ruleset is for unwanted useragents... possibly email harvesters
RewriteCond %{HTTP_USER_AGENT} ^[A-Z]+$[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.Browse\s[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.Eval[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.Surf [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*Harvest [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*HTTrack [NC,OR]
# RewriteCond %{HTTP_USER_AGENT} ^.*libwww-perl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*LWP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^.*prospector[NC,OR]
RewriteCond %{HTTP_USER_AGENT} AsiaNetBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ASSORT [NC,OR]
RewriteCond %{HTTP_USER_AGENT} attache [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ATHENS [NC,OR]
RewriteCond %{HTTP_USER_AGENT} autohttp [NC,OR]
RewriteCond %{HTTP_USER_AGENT} bew [NC,OR]
RewriteCond %{HTTP_USER_AGENT} BlackWidow [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bot\ mailto:craftbot@yahoo.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Bullseye [NC,OR]
RewriteCond %{HTTP_USER_AGENT} CherryPicker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ChinaClaw[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Crescent [NC,OR]
RewriteCond %{HTTP_USER_AGENT} curl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} devsoft's\ http\ component [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Deweb[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Digimarc [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Digger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} digout4uagent[NC,OR]
RewriteCond %{HTTP_USER_AGENT} DIIbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} DISCo[NC,OR]
RewriteCond %{HTTP_USER_AGENT} dloader(NaverRobot) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Download\ Demon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} eCatch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ecollector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Educate\ Search [NC,OR]
RewriteCond %{HTTP_USER_AGENT} EirGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} EmailCollector [NC,OR]
RewriteCond %{HTTP_USER_AGENT} EmailSiphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} EmailWolf[NC,OR]
RewriteCond %{HTTP_USER_AGENT} EO\ Browse [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Express\ WebPictures[NC,OR]
RewriteCond %{HTTP_USER_AGENT} ExtractorPro [NC,OR]
RewriteCond %{HTTP_USER_AGENT} EyeNetIE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} fastlwspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} FEZhead[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Fetch[NC,OR]
RewriteCond %{HTTP_USER_AGENT} FlashGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Franklin\ Locator[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Full\ Web\ Bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Getleft [NC,OR]
RewriteCond %{HTTP_USER_AGENT} GetRight [NC,OR]
RewriteCond %{HTTP_USER_AGENT} GetURL [NC,OR]
RewriteCond %{HTTP_USER_AGENT} GetWebPage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Go!Zilla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Gozilla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} go-ahead-got-it [NC,OR]
RewriteCond %{HTTP_USER_AGENT} GrabNet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Grafula [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HMView [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTML\ Works [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
# RewriteCond %{HTTP_USER_AGENT} ia_archiver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} IBM_Planetwide [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Image\ Stripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Image\ Sucker[NC,OR]
RewriteCond %{HTTP_USER_AGENT} IncyWincy[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Industry\ Program[NC,OR]
RewriteCond %{HTTP_USER_AGENT} InterGET [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Internet\ Explore\ 5\.x [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Internet\ Ninja [NC,OR]
RewriteCond %{HTTP_USER_AGENT} InternetSeer.com [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Irvine [NC,OR]
RewriteCond %{HTTP_USER_AGENT} JetCar [NC,OR]
RewriteCond %{HTTP_USER_AGENT} JOC\ Web\ Spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} KWebGet [NC,OR]
RewriteCond %{HTTP_USER_AGENT} larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} leech[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mass\ Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MCspider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Microsoft\ URL [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MIDown\ tool [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mirror [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Missauga\ Locator[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Missigua\ Locator[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mister\ PiX [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Monster [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mozilla.*NEWT[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mozilla\/3\.0\.\+Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mozilla\/3.Mozilla\/2\.01 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mozilla\/4\.0$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Mozzilla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MSIECrawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Navroad [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NearSite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NetAnts [NC,OR]
RewriteCond %{HTTP_USER_AGENT} netattache [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NetCarta [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NetSpider[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Net\ Vampire [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NetZIP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NICErsPRO[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Octopus [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Offline\ Explorer[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Offline\ Navigator [NC,OR]
RewriteCond %{HTTP_USER_AGENT} OpaL [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Openfind [NC,OR]
RewriteCond %{HTTP_USER_AGENT} OpenTextSiteCrawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PackRat [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PageGrabber [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Papa\ Foto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} pavuk[NC,OR]
RewriteCond %{HTTP_USER_AGENT} pcBrowser[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Plucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Production\ Bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Program\ Shareware [NC,OR]
RewriteCond %{HTTP_USER_AGENT} PushSite [NC,OR]
RewriteCond %{HTTP_USER_AGENT} RealDownload [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ReGet[NC,OR]
RewriteCond %{HTTP_USER_AGENT} RepoMonkey [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Rover[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Rsync[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Siphon [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ScoutAbout [NC,OR]
RewriteCond %{HTTP_USER_AGENT} searchterms\.it [NC,OR]
RewriteCond %{HTTP_USER_AGENT} semanticdiscovery[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Shai [NC,OR]
RewriteCond %{HTTP_USER_AGENT} sitecheck[NC,OR]
RewriteCond %{HTTP_USER_AGENT} SiteSnagger [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SmartDownload[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Spegla [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SpiderBot[NC,OR]
RewriteCond %{HTTP_USER_AGENT} SuperBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SuperHTTP[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Surfbot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SurfWalker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} tAkeOut [NC,OR]
RewriteCond %{HTTP_USER_AGENT} tarspider[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Teleport\ Pro[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Telesoft [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Templeton[NC,OR]
RewriteCond %{HTTP_USER_AGENT} UtilMind [NC,OR]
RewriteCond %{HTTP_USER_AGENT} VoidEYE [NC,OR]
RewriteCond %{HTTP_USER_AGENT} w3mir[NC,OR]
RewriteCond %{HTTP_USER_AGENT} web.by.mail [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebBandit[NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCopier[NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCopy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebEMailExtrac [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Web\ Image\ Collector[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Web\ Sucker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebAuto [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCopier[NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebFetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebMiner [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebReaper[NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebSauger[NC,OR]
RewriteCond %{HTTP_USER_AGENT} Website\ eXtractor [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Website\ Quester [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebSnake [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebStripper [NC,OR]
RewriteCond %{HTTP_USER_AGENT} webvac [NC,OR]
RewriteCond %{HTTP_USER_AGENT} webwalk [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebWhacker [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebZIP [NC,OR]
# RewriteCond %{HTTP_USER_AGENT} wget [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WhosTalking [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Widow[NC,OR]
RewriteCond %{HTTP_USER_AGENT} WUMPUS [NC,OR]
RewriteCond %{HTTP_USER_AGENT} www\.pl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Xaldon\ WebSpider[NC,OR]
RewriteCond %{HTTP_USER_AGENT} XGET [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Yandex [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Zeus.*Webster[NC]
#RewriteCond %{HTTP_USER_AGENT} test[NC]
RewriteCond %{REQUEST_URI}!^/badUA\.html [NC]
RewriteRule .* /badUA.html [L,E=HTTP_USER_AGENT:BAD_USER_AGENT]
# this ruleset is to stop blank user agents with blank referrers
RewriteCond %{HTTP_REFERER} ^-?$
RewriteCond %{HTTP_USER_AGENT} ^-?$
RewriteRule .* /cgi-bin/noagent.cmd [L,T=application/x-httpd-cgi]
===== snip =====
there're quite a few in there... watch out for hosing your server... i got mine caught in endless loops several times while adjusting this from site wide (internal to httpd.conf) to per directory (.htaccess)... was glad i run my own server :wink:
a final note... watch for missing spaces... ther should be a space before every [ and the ¦ must be replaced by the verticle pipe on your keyboard... this site strips out extra spaces and tabs and replaces the split verticle pipe by a solid one... you'll have to watch these things...
FWIW: the above is taken directly, with no modification, from one of my main site .htaccess files... this site is live and online at this time with the above...
HTH
RewriteCond %{HTTP_REFERER} ^-?$ [NC]
RewriteCond %{HTTP_USER_AGENT} ^-?$ [NC]
RewriteRule .* - [F,L]
My question is, can ^$ safely replace ^-?$? I ask because I used cPanel to write part of my .htaccess file. To prevent hotlinking it denies gif, png's etc when the referrer is!^http://myserver,!^http://www.myserver and!^$.
Isn't!^$ the same as "-"? Or am I wrong?
^$ means "empty"
^-?$ means "may contain only a single '-' character, but the '-' character is not required." Or, in other words, "either blank or contains a single '-' character."
In the code posted above, we are looking for someone wishing to bypass a block for empty user-agent string by using a "-" character as their user-agent. In common log format, the log entry for a blank user-agent and a user-agent of "-" would appear identical.
So the code above blocks either blank user-agents, or "fake" blank user-agents.
Ref: [etext.lib.virginia.edu...]
HTH,
Jim
Welcome to WebmasterWorld [webmasterworld.com]!
We've had a recent spotting of Webcature 3.0 - which may or may not be the same thing - over in the Search Engine Spider Identification forum [webmasterworld.com].
The fact that 3.0 doesn't fetch robots.txt is a bad sign...
Jim