homepage Welcome to WebmasterWorld Guest from 54.167.238.60
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
A Close to Perfect .htaccess ban list - Revisited
Update .htaccess User-Agent Ban Info
hybrid6studios




msg:3063813
 1:26 am on Aug 29, 2006 (gmt 0)

Hey All,

Along with you, I've done a lot of work to create a perfect .htaccess Ban List. There will always be rogues that get through or new and better bots. You can't block all of them but you CAN keep your server load down and your access streamlined to your target audience. I have 228 User Agents blocked and have minimal annoyances. Here is my latest .htaccess code:

## .htaccess Code :: BEGIN
## Block Bad Bots by User-Agent
SetEnvIfNoCase User-Agent ^$ bad_bot
SetEnvIfNoCase User-Agent "^AESOP_com_SpiderMan" bad_bot
SetEnvIfNoCase User-Agent "^Alexibot" bad_bot
SetEnvIfNoCase User-Agent "Anonymouse.org" bad_bot
SetEnvIfNoCase User-Agent "^asterias" bad_bot
SetEnvIfNoCase User-Agent "^attach" bad_bot
SetEnvIfNoCase User-Agent "^BackDoorBot" bad_bot
SetEnvIfNoCase User-Agent "^BackWeb" bad_bot
SetEnvIfNoCase User-Agent "Bandit" bad_bot
SetEnvIfNoCase User-Agent "^Baiduspider" bad_bot
SetEnvIfNoCase User-Agent "^BatchFTP" bad_bot
SetEnvIfNoCase User-Agent "^Bigfoot" bad_bot
SetEnvIfNoCase User-Agent "^Black.Hole" bad_bot
SetEnvIfNoCase User-Agent "^BlackWidow" bad_bot
SetEnvIfNoCase User-Agent "^BlowFish" bad_bot
SetEnvIfNoCase User-Agent "^Bot\ mailto:craftbot@yahoo.com" bad_bot
SetEnvIfNoCase User-Agent "^BotALot" bad_bot
SetEnvIfNoCase User-Agent "Buddy" bad_bot
SetEnvIfNoCase User-Agent "^BuiltBotTough" bad_bot
SetEnvIfNoCase User-Agent "^Bullseye" bad_bot
SetEnvIfNoCase User-Agent "^BunnySlippers" bad_bot
SetEnvIfNoCase User-Agent "^Cegbfeieh" bad_bot
SetEnvIfNoCase User-Agent "^CheeseBot" bad_bot
SetEnvIfNoCase User-Agent "^CherryPicker" bad_bot
SetEnvIfNoCase User-Agent "^ChinaClaw" bad_bot
SetEnvIfNoCase User-Agent "Collector" bad_bot
SetEnvIfNoCase User-Agent "Copier" bad_bot
SetEnvIfNoCase User-Agent "^CopyRightCheck" bad_bot
SetEnvIfNoCase User-Agent "^cosmos" bad_bot
SetEnvIfNoCase User-Agent "^Crescent" bad_bot
SetEnvIfNoCase User-Agent "^Curl" bad_bot
SetEnvIfNoCase User-Agent "^Custo" bad_bot
SetEnvIfNoCase User-Agent "^DA" bad_bot
SetEnvIfNoCase User-Agent "^DISCo" bad_bot
SetEnvIfNoCase User-Agent "^DIIbot" bad_bot
SetEnvIfNoCase User-Agent "^DittoSpyder" bad_bot
SetEnvIfNoCase User-Agent "^Download" bad_bot
SetEnvIfNoCase User-Agent "^Download\ Demon" bad_bot
SetEnvIfNoCase User-Agent "^Download\ Devil" bad_bot
SetEnvIfNoCase User-Agent "^Download\ Wonder" bad_bot
SetEnvIfNoCase User-Agent "Downloader" bad_bot
SetEnvIfNoCase User-Agent "^dragonfly" bad_bot
SetEnvIfNoCase User-Agent "^Drip" bad_bot
SetEnvIfNoCase User-Agent "^eCatch" bad_bot
SetEnvIfNoCase User-Agent "^EasyDL" bad_bot
SetEnvIfNoCase User-Agent "^ebingbong" bad_bot
SetEnvIfNoCase User-Agent "^EirGrabber" bad_bot
SetEnvIfNoCase User-Agent "^EmailCollector" bad_bot
SetEnvIfNoCase User-Agent "^EmailSiphon" bad_bot
SetEnvIfNoCase User-Agent "^EmailWolf" bad_bot
SetEnvIfNoCase User-Agent "^EroCrawler" bad_bot
SetEnvIfNoCase User-Agent "^Exabot" bad_bot
SetEnvIfNoCase User-Agent "^Express\ WebPictures" bad_bot
SetEnvIfNoCase User-Agent "Extractor" bad_bot
SetEnvIfNoCase User-Agent "^EyeNetIE" bad_bot
SetEnvIfNoCase User-Agent "^FileHound" bad_bot
SetEnvIfNoCase User-Agent "^FlashGet" bad_bot
SetEnvIfNoCase User-Agent "^Foobot" bad_bot
SetEnvIfNoCase User-Agent "^flunky" bad_bot
SetEnvIfNoCase User-Agent "^FrontPage" bad_bot
SetEnvIfNoCase User-Agent "^GetRight" bad_bot
SetEnvIfNoCase User-Agent "^GetSmart" bad_bot
SetEnvIfNoCase User-Agent "^GetWeb!" bad_bot
SetEnvIfNoCase User-Agent "^Go!Zilla" bad_bot
SetEnvIfNoCase User-Agent "Google\ Wireless\ Transcoder" bad_bot
SetEnvIfNoCase User-Agent "^Go-Ahead-Got-It" bad_bot
SetEnvIfNoCase User-Agent "^gotit" bad_bot
SetEnvIfNoCase User-Agent "Grabber" bad_bot
SetEnvIfNoCase User-Agent "^GrabNet" bad_bot
SetEnvIfNoCase User-Agent "^Grafula" bad_bot
SetEnvIfNoCase User-Agent "^Harvest" bad_bot
SetEnvIfNoCase User-Agent "^hloader" bad_bot
SetEnvIfNoCase User-Agent "^HMView" bad_bot
SetEnvIfNoCase User-Agent "^httplib" bad_bot
SetEnvIfNoCase User-Agent "^HTTrack" bad_bot
SetEnvIfNoCase User-Agent "^humanlinks" bad_bot
SetEnvIfNoCase User-Agent "^ia_archiver" bad_bot
SetEnvIfNoCase User-Agent "^IlseBot" bad_bot
SetEnvIfNoCase User-Agent "^Image\ Stripper" bad_bot
SetEnvIfNoCase User-Agent "^Image\ Sucker" bad_bot
SetEnvIfNoCase User-Agent "Indy\ Library" bad_bot
SetEnvIfNoCase User-Agent "^InfoNaviRobot" bad_bot
SetEnvIfNoCase User-Agent "^InfoTekies" bad_bot
SetEnvIfNoCase User-Agent "^Intelliseek" bad_bot
SetEnvIfNoCase User-Agent "^InterGET" bad_bot
SetEnvIfNoCase User-Agent "^Internet\ Ninja" bad_bot
SetEnvIfNoCase User-Agent "^Iria" bad_bot
SetEnvIfNoCase User-Agent "^Jakarta" bad_bot
SetEnvIfNoCase User-Agent "^JennyBot" bad_bot
SetEnvIfNoCase User-Agent "^JetCar" bad_bot
SetEnvIfNoCase User-Agent "^JOC" bad_bot
SetEnvIfNoCase User-Agent "^JustView" bad_bot
SetEnvIfNoCase User-Agent "^Jyxobot" bad_bot
SetEnvIfNoCase User-Agent "^Kenjin.Spider" bad_bot
SetEnvIfNoCase User-Agent "^Keyword.Density" bad_bot
SetEnvIfNoCase User-Agent "^larbin" bad_bot
SetEnvIfNoCase User-Agent "^LeechFTP" bad_bot
SetEnvIfNoCase User-Agent "^LexiBot" bad_bot
SetEnvIfNoCase User-Agent "^lftp" bad_bot
SetEnvIfNoCase User-Agent "^libWeb/clsHTTP" bad_bot
SetEnvIfNoCase User-Agent "^likse" bad_bot
SetEnvIfNoCase User-Agent "^LinkextractorPro" bad_bot
SetEnvIfNoCase User-Agent "^LinkScan/8.1a.Unix" bad_bo
SetEnvIfNoCase User-Agent "^LNSpiderguy" bad_bott
SetEnvIfNoCase User-Agent "^LinkWalker" bad_bot
SetEnvIfNoCase User-Agent "^lwp-trivial" bad_bot
SetEnvIfNoCase User-Agent "^LWP::Simple" bad_bot
SetEnvIfNoCase User-Agent "^Magnet" bad_bot
SetEnvIfNoCase User-Agent "^Mag-Net" bad_bot
SetEnvIfNoCase User-Agent "^MarkWatch" bad_bot
SetEnvIfNoCase User-Agent "^Mass\ Downloader" bad_bot
SetEnvIfNoCase User-Agent "^Mata.Hari" bad_bot
SetEnvIfNoCase User-Agent "^Memo" bad_bot
SetEnvIfNoCase User-Agent "^Microsoft.URL" bad_bot
SetEnvIfNoCase User-Agent "^Microsoft\ URL\ Control" bad_bot
SetEnvIfNoCase User-Agent "^MIDown\ tool" bad_bot
SetEnvIfNoCase User-Agent "^MIIxpc" bad_bot
SetEnvIfNoCase User-Agent "^Mirror" bad_bot
SetEnvIfNoCase User-Agent "^Missigua\ Locator" bad_bot
SetEnvIfNoCase User-Agent "^Mister\ PiX" bad_bot
SetEnvIfNoCase User-Agent "^moget" bad_bot
SetEnvIfNoCase User-Agent "^Mozilla/3.Mozilla/2.01" bad_bot
SetEnvIfNoCase User-Agent "^Mozilla.*NEWT" bad_bot
SetEnvIfNoCase User-Agent "^NAMEPROTECT" bad_bot
SetEnvIfNoCase User-Agent "^Navroad" bad_bot
SetEnvIfNoCase User-Agent "^NearSite" bad_bot
SetEnvIfNoCase User-Agent "^NetAnts" bad_bot
SetEnvIfNoCase User-Agent "^Netcraft" bad_bot
SetEnvIfNoCase User-Agent "^NetMechanic" bad_bot
SetEnvIfNoCase User-Agent "^NetSpider" bad_bot
SetEnvIfNoCase User-Agent "^Net\ Vampire" bad_bot
SetEnvIfNoCase User-Agent "^NetZIP" bad_bot
SetEnvIfNoCase User-Agent "^NextGenSearchBot" bad_bot
SetEnvIfNoCase User-Agent "^NG" bad_bot
SetEnvIfNoCase User-Agent "^NICErsPRO" bad_bot
SetEnvIfNoCase User-Agent "^NimbleCrawler" bad_bot
SetEnvIfNoCase User-Agent "^Ninja" bad_bot
SetEnvIfNoCase User-Agent "^NPbot" bad_bot
SetEnvIfNoCase User-Agent "^Octopus" bad_bot
SetEnvIfNoCase User-Agent "^Offline\ Explorer" bad_bot
SetEnvIfNoCase User-Agent "^Offline\ Navigator" bad_bot
SetEnvIfNoCase User-Agent "^Openfind" bad_bot
SetEnvIfNoCase User-Agent "^OutfoxBot" bad_bot
SetEnvIfNoCase User-Agent "^PageGrabber" bad_bot
SetEnvIfNoCase User-Agent "^Papa\ Foto" bad_bot
SetEnvIfNoCase User-Agent "^pavuk" bad_bot
SetEnvIfNoCase User-Agent "^pcBrowser" bad_bot
SetEnvIfNoCase User-Agent "^PHP\ version\ tracker" bad_bot
SetEnvIfNoCase User-Agent "^Pockey" bad_bot
SetEnvIfNoCase User-Agent "^ProPowerBot/2.14" bad_bot
SetEnvIfNoCase User-Agent "^ProWebWalker" bad_bot
SetEnvIfNoCase User-Agent "^psbot" bad_bot
SetEnvIfNoCase User-Agent "^Pump" bad_bot
SetEnvIfNoCase User-Agent "^QueryN.Metasearch" bad_bot
SetEnvIfNoCase User-Agent "^RealDownload" bad_bot
SetEnvIfNoCase User-Agent "Reaper" bad_bot
SetEnvIfNoCase User-Agent "Recorder" bad_bot
SetEnvIfNoCase User-Agent "^ReGet" bad_bot
SetEnvIfNoCase User-Agent "^RepoMonkey" bad_bot
SetEnvIfNoCase User-Agent "^RMA" bad_bot
SetEnvIfNoCase User-Agent "Siphon" bad_bot
SetEnvIfNoCase User-Agent "sitecheck.internetseer.com" bad_bot
SetEnvIfNoCase User-Agent "^SiteSnagger" bad_bot
SetEnvIfNoCase User-Agent "^SlySearch" bad_bot
SetEnvIfNoCase User-Agent "^SmartDownload" bad_bot
SetEnvIfNoCase User-Agent "^Snake" bad_bot
SetEnvIfNoCase User-Agent "^Snapbot" bad_bot
SetEnvIfNoCase User-Agent "^Snoopy" bad_bot
SetEnvIfNoCase User-Agent "^sogou" bad_bot
SetEnvIfNoCase User-Agent "^SpaceBison" bad_bot
SetEnvIfNoCase User-Agent "^SpankBot" bad_bot
SetEnvIfNoCase User-Agent "^spanner" bad_bot
SetEnvIfNoCase User-Agent "^Sqworm" bad_bot
SetEnvIfNoCase User-Agent "Stripper" bad_bot
SetEnvIfNoCase User-Agent "Sucker" bad_bot
SetEnvIfNoCase User-Agent "^SuperBot" bad_bot
SetEnvIfNoCase User-Agent "^SuperHTTP" bad_bot
SetEnvIfNoCase User-Agent "^Surfbot" bad_bot
SetEnvIfNoCase User-Agent "^suzuran" bad_bot
SetEnvIfNoCase User-Agent "^Szukacz/1.4" bad_bot
SetEnvIfNoCase User-Agent "^tAkeOut" bad_bot
SetEnvIfNoCase User-Agent "^Teleport" bad_bot
SetEnvIfNoCase User-Agent "^Telesoft" bad_bot
SetEnvIfNoCase User-Agent "^TurnitinBot/1.5" bad_bot
SetEnvIfNoCase User-Agent "^The.Intraformant" bad_bot
SetEnvIfNoCase User-Agent "^TheNomad" bad_bot
SetEnvIfNoCase User-Agent "^TightTwatBot" bad_bot
SetEnvIfNoCase User-Agent "^Titan" bad_bot
SetEnvIfNoCase User-Agent "^toCrawl/UrlDispatcher" bad_bot
SetEnvIfNoCase User-Agent "^True_Robot" bad_bot
SetEnvIfNoCase User-Agent "^turingos" bad_bot
SetEnvIfNoCase User-Agent "^TurnitinBot" bad_bot
SetEnvIfNoCase User-Agent "^URLy.Warning" bad_bot
SetEnvIfNoCase User-Agent "^Vacuum" bad_bot
SetEnvIfNoCase User-Agent "^VCI" bad_bot
SetEnvIfNoCase User-Agent "^VoidEYE" bad_bot
SetEnvIfNoCase User-Agent "^Web\ Image\ Collector" bad_bot
SetEnvIfNoCase User-Agent "^Web\ Sucker" bad_bot
SetEnvIfNoCase User-Agent "^WebAuto" bad_bot
SetEnvIfNoCase User-Agent "^WebBandit" bad_bot
SetEnvIfNoCase User-Agent "^Webclipping.com" bad_bot
SetEnvIfNoCase User-Agent "^WebCopier" bad_bot
SetEnvIfNoCase User-Agent "^WebEMailExtrac.*" bad_bot
SetEnvIfNoCase User-Agent "^WebEnhancer" bad_bot
SetEnvIfNoCase User-Agent "^WebFetch" bad_bot
SetEnvIfNoCase User-Agent "^WebGo\ IS" bad_bot
SetEnvIfNoCase User-Agent "^Web.Image.Collector" bad_bot
SetEnvIfNoCase User-Agent "^WebLeacher" bad_bot
SetEnvIfNoCase User-Agent "^WebmasterWorldForumBot" bad_bot
SetEnvIfNoCase User-Agent "^WebReaper" bad_bot
SetEnvIfNoCase User-Agent "^WebSauger" bad_bot
SetEnvIfNoCase User-Agent "^WebSite" bad_bot
SetEnvIfNoCase User-Agent "^Website\ eXtractor" bad_bot
SetEnvIfNoCase User-Agent "^Website\ Quester" bad_bot
SetEnvIfNoCase User-Agent "^Webster" bad_bot
SetEnvIfNoCase User-Agent "^WebStripper" bad_bot
SetEnvIfNoCase User-Agent "^WebWhacker" bad_bot
SetEnvIfNoCase User-Agent "^WebZIP" bad_bot
SetEnvIfNoCase User-Agent "^Wget" bad_bot
SetEnvIfNoCase User-Agent "Whacker" bad_bot
SetEnvIfNoCase User-Agent "^Widow" bad_bot
SetEnvIfNoCase User-Agent "^WISENutbot" bad_bot
SetEnvIfNoCase User-Agent "^WWWOFFLE" bad_bot
SetEnvIfNoCase User-Agent "^WWW-Collector-E" bad_bot
SetEnvIfNoCase User-Agent "^Xaldon" bad_bot
SetEnvIfNoCase User-Agent "^Xenu" bad_bot
SetEnvIfNoCase User-Agent "^Zeus" bad_bot
SetEnvIfNoCase User-Agent "^Zyborg" bad_bot

<Limit GET POST HEAD>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Limit>
## .htaccess Code :: END

Just insert into your .htaccess file and you'll be in great shape. Let me know if you have updates or questions.

 

jdMorgan




msg:3063894
 3:46 am on Aug 29, 2006 (gmt 0)

I would suggest a <Files *> container, rather than a <Limit> container, unless it is your intent to allow these unwelcome user-agents to make PUT, DELETE, CONNECT, OPTIONS, PATCH, PROPFIND, PROPPATCH, MKCOL, COPY, MOVE, LOCK, and UNLOCK requests to your site.

If it is your intent to allow these other methods, then <LIMIT GET POST> is sufficient; As documented, specifying GET implies that HEAD requests are also allowed.

The length of this code can be reduced greatly if desired by combining user-agents. For example, denying from "e-?mail" and "download" using SetEnvIfNoCase will cover many undesirables all at once.

I always suggest that such lists be trimmed to include only those user-agents which are actually problems for your site, plus any newly-reported ones that *may* become a problem for your site. As WebmasterWorld member IncrediBill points out, the problem with such lists is that they grow forever if not trimmed, and you are always playing catch-up to new "bad" user-agents. And if your site is growing in popularity, the length of the list and the number of requests to your site will eventually result in a noticeable performance penalty.

For sites already experiencing that problem, a combined whitelist/blacklist approach often results in smaller overall code size and better performance.

These comments are intended to provide a bit of background info, and not to detract from the work that went into compiling this list of user-agents!

Jim

hybrid6studios




msg:3065180
 1:30 am on Aug 30, 2006 (gmt 0)

JD thanks! Any improvements are always welcome...that's why I put it out here...it all started with work others did, and I'm merely contributing...I will definitely put that into practice. Anyone else, please your comments as well. Thanks!

jackhandy




msg:3065420
 7:54 am on Aug 30, 2006 (gmt 0)

Could someone please elaborate again on this topic? I spent a few hours reading all of the original thread and was going to implement bad_bot.pl but I can't find it on this web site anywhere. That thread ended in 2004. What happened to doing this with rewrite? I got to the end of the original thread and was lost all over again. What is actually the best way of doing this? If I have a PHP member program protecting directories containing content would these "bad bots" still be able to download my site? There are too many unanswered questions to add all this to my .htaccess file without more information, but I would like to KNOW if someone is attempting to steal my site.

LunaC




msg:3065872
 2:55 pm on Aug 30, 2006 (gmt 0)

What exactly does ^ mean, and why do some have that and others don't?

Does ^badguy = [anythinghere]badguy?
If so, is there anytime that would be bad to have? (Sorry, but searching anywhere for the definition of ^ doesn't exactly bring an answer.)

This is a better way to write for a more complete block?

<Files *>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Files>

jdMorgan




msg:3065886
 3:02 pm on Aug 30, 2006 (gmt 0)

"^" is a 'start anchor.' Patterns with a start anchor will only match if the string begins with the specified pattern. Do not use any mod_rewrite code until you understand it completely -- To do so is to leave yourself depending on others to find and fix code problems which may cause major trouble on your site.

See the tutorials listed in our forum charter [webmasterworld.com], especially, the one on Regular Expressions.

Jim

jbgilbert




msg:3065960
 3:37 pm on Aug 30, 2006 (gmt 0)

Just to make sure:

Is this the proper structure for "files" to replace the limit block?

<Files *>
Order Allow,Deny
Allow from all
Deny from env=bad_bot
</Files>

LunaC




msg:3067409
 2:38 pm on Aug 31, 2006 (gmt 0)

Ah, thanks. That tutorial is exactly what I'd been hoping to find.

hybrid6studios




msg:3067418
 2:48 pm on Aug 31, 2006 (gmt 0)

This is excellent I used the <Files /> tags and they are working perfectly so far, and I'm working on pairing down the code.

I found a separate prob...I downloaded Offline Explorer and used it on my site to see how it worked. They've gotten smart with recent versions...They don't use their User-Agent by default anymore...they use IE. When I switched it to the Offline Explorer UA my htaccess worked and blocked it, but that's not a real-life scenario...normally it will be set to IE and no one will change it. So that's a problem...any ideas?

cla313




msg:3649910
 5:09 pm on May 14, 2008 (gmt 0)

I have a few question:
1) do you need to specify the <Files *> </Files> box or is it also possible to have the list above followed by the directives without the Files tags?
Example:

SetEnvIfNoCase ...
SetEnvIfNoCase User-Agent "^Black.Hole" bad_bot
SetEnvIfNoCase User-Agent "^Java" bad_bot
SetEnvIfNoCase User-Agent "^Jakarta" bad_bot
SetEnvIfNoCase ...
Order Allow,Deny
Allow from all
Deny from env=bad_bot

2) My awstats reports show also "Acrobat" as a grabber; is it recommendable to add it to the list and if so, how? I fear that it will not be possible to view pdf files from my site if I do that.

3) Somewhere else it is recommended to block also empty user agents. How do I do that with SetEnvIfNoCase? I tried
SetEnvIfNoCase User-Agent "" bad_bot
but I got a 500 error...

Thanks in advance!

[edited by: jdMorgan at 5:42 pm (utc) on May 14, 2008]
[edit reason] No URLs, please. [/edit]

cla313




msg:3649940
 5:37 pm on May 14, 2008 (gmt 0)

In fact I have one more question, related to my question number 3) above.
How to block with SetEnvIfNoCase the following:
a) empty user agent
b) layered technologies, and other bots in the list of Johann Burkard which are not listed above?

I also wonder why hybrid6studios adds a \ before a space like in "Indy\ Library". would "Indy Library" not be effective?

Thanks!

jdMorgan




msg:3650960
 3:36 pm on May 15, 2008 (gmt 0)

1) You can omit <Files *>

2) Never heard of it. Many of the user-agents in the threads here are obsolete and no longer active. I suggest you go through your stats, and include blocking lines only for those user-agents that are actually a problem for your site.

3) Empty user-agent:

SetEnvIf User-agent ^$ bad-bot

"Fake" empty user-agent:

SetEnvIf User-agent ^-$

Both:

SetEnvIf User-agent ^-?$

Layered and others can be blocked by IP address

One IP address:
Deny from 12.34.56.78

or 256 IP addresses 12.34.56.0 - 12.34.56.255
Deny from 12.24.56

or
Deny from 12.34.56.0/24

or
Deny from 12.34.56.0/255.255.255.0

(that's an example IP address/address range only)

You could also use reverse-DNS if available on your server, but some servers will then start logging all accesses using hostnames, and reverse-DNS checks are slow and introduce an additional point of failure -- If your server cannot successfully complete a request to the DNS server, the user's request will 'hang'.

The format for that would be:
Deny from example.com

See mod_access and mod_setenvif docs for more details.

Jim

cla313




msg:3651925
 4:07 pm on May 16, 2008 (gmt 0)

Thank you Jim,
you have been most helpful!

Is this an appropriate thread to ask other people if they know about "potential" spammers (useragents) which do not appear in the list above? or do you know of another thread at webmasterworld where I can ask for feedback on that?

Thanks again!

wilderness




msg:3651962
 4:49 pm on May 16, 2008 (gmt 0)

These having been copied and pasted, many of which are UA's that are not seen these days.

Their are numerous and redundant repetion of excess in these lines.

As well of the repeated use of QUOTES in each line, which in these examples implies EXACTLY AS and is redundant and without understanding of the application.
EX:
"^CherryPicker"

Implies EXACTLY AS and Begins with CherryPicker.

Whereas ^CherryPicker (BEGINS with CherryPicker) is intended, however see later explanation for condesning of lines.

#condenses two lines
SetEnvIfNoCase User-Agent ^BackWeb bad_bot
SetEnvIfNoCase User-Agent ^Black bad_bot
SetEnvIf User-Agent Grab bad_bot
SetEnvIfNoCase User-Agent ^Info bad_bot
SetEnvIfNoCase User-Agent lwp bad_bot
SetEnvIfNoCase User-Agent ^Pro bad_bot
SetEnvIfNoCase User-Agent site bad_bot

#condenses three lines
SetEnvIfNoCase User-Agent ^Email bad_bot
SetEnvIfNoCase User-Agent ^Get bad_bot
SetEnvIfNoCase User-Agent ^Link bad_bot

#condenses five lines
SetEnvIfNoCase User-Agent ^Download bad_bot

#condenses six lines
SetEnvIfNoCase User-Agent ^Net bad_bot

#condenses twenty-two lines
SetEnvIfNoCase User-Agent ^Web bad_bot

There are likely more ways to condense lines, however I've provided enough examples to this rebirth ;)

wilderness




msg:3651998
 5:23 pm on May 16, 2008 (gmt 0)

2) My awstats reports show also "Acrobat" as a grabber; is it recommendable to add it to the list and if so, how? I fear that it will not be possible to view pdf files from my site if I do that.

The full version of Acrobat/Adobe offers a "print" option to PDF which will spider and retrieve entire websites and outbound linked pages (if desired) as well.

The UA escapes me at the moment and I have too much going on presently.
Will provide the UA later on.

jdMorgan




msg:3652002
 5:25 pm on May 16, 2008 (gmt 0)

The reason I stated above that it's a waste of (CPU) time to parse this whole list unless these UAs actually appear in your logs is that most of the scrapers and harvesters have moved on, and are now using UA strings that look just like a legitimate browser. Some of the agents in the list are still active, but not very, and not many. In short, they've discovered it's not a real good idea to identify yourself as a scraper/harvester any more...

In order to detect problematic accesses these days, you need far more sophisticated methods, some of which are "paid options" you can install on your server. Some intermediate solutions are available here at WebmasterWorld in the PERL and PHP forum libraries, generally findable using "bad bot" and "runaway bots" in the WebmasterWorld forum library search facility.

Jim

cla313




msg:3652004
 5:28 pm on May 16, 2008 (gmt 0)

Thanks to both of you!
I will definitely look at those too!
is it ok if I submit strange agents found in my logs?

wilderness




msg:3652089
 7:19 pm on May 16, 2008 (gmt 0)

"WebCapture" is the UA (or at least a portion) used by the full version of Adobe to create PDF's of a website.

EarleyGirl




msg:3691697
 6:34 pm on Jul 6, 2008 (gmt 0)

You can omit <Files *>

This thread has me wondering about a few things. Are the results the same when omitting <Files *> for denying IPs?

Also, I have the following in my .htaccess and wonder if I'm doing it correctly:
<Files ~ "example.php">
Order allow,deny
Deny from all
</Files>

<Files ~ ".htaccess">
Order allow,deny
Deny from all
</Files>

order allow,deny
deny from example.com
allow from all

Regarding upper/lowercase is the following recommended or is there no difference?
Order allow,deny
Deny from example.com
Allow from all

Thanks,
EG

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved