Page is a not externally linkable
- Search Engines
-- Search Engine Spider and User Agent Identification
---- Updating bot ban list & cleaning out obsolete entries


KenB - 1:37 am on Dec 23, 2009 (gmt 0)


Okay, here is my updated .htaccess UA ban list. It is based on the classic list found at [webmasterworld.com...] and the list of default UA strings for programing libraries found at [webmasterworld.com...]

My methodology was to select a representative sample of logs from my website representing 85 days with a total uncompressed file size of 10.6gb. I put all of the selected log files into their own folder on my computer and appended ".txt" to the end of the file names so that Windows would search the files. I then searched the files using Windows for each UA string. If logs were returned I opened up a sampling of logs to verify the strings and look for potential IP ranges I could block. I also added new UA strings not found on the lists above based on what I was finding in my logs as I went along.

My hope is that others will post some of the UA strings they are blocking against that are actively hitting their server. I would also hope that folks would resist the urge to post monolithic lists that have not been cleaned out of inactive UA strings.

# PREVENT PREFETCHING OF PAGES
#=====================================
RewriteCond %{X-moz} ^prefetch [NC,OR]

# BLOCK DEFAULT UA OF PROGRAMMING LIBRARIES
#==================================================
RewriteCond %{HTTP_USER_AGENT} ^curl/ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HTMLParser [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Jakarta\ Commons [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Java [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^libcurl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^libwww-perl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LWP::Simple [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^lwp-request [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ Data\ Access [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL\ Control [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MS\ Web\ Services\ Client\ Protocol [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PECL::HTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^POE-Component-Client-HTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PycURL [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Python-urllib [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Snoopy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^VB\ Project [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WWW::Mechanize [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} RPT-HTTPClient [NC,OR]

# BLOCK BAD BOTS, ETC. - VERIFIED IN LOGS 2009-12
#==================================================
RewriteCond %{HTTP_USER_AGENT} ^$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^AISearchBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^al_viewer [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^amibot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^BDFetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^core-project [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Cuam\ Ver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DoubleVerify\ Crawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/(2¦3)\.0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/([1-9])\.([0-9])\ http [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/([1-9])\.([0-9])$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Ruby [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SBL-BOT [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Space\ Bison [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Squid-Prefetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Twisted\ PageGetter [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebImages [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^YebolBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Zend_Http_Client [NC,OR]
RewriteCond %{HTTP_USER_AGENT} 80legs [NC,OR]
RewriteCond %{HTTP_USER_AGENT} aiHitBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Arachmo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} asynchttp [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} DreamPassport [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Email [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Exabot-Thumbnails [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Extractor [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Fetch\ API\ Request [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MS\ FrontPage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MSFrontPage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MSIECrawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MyDiGiRabi [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NEWT [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ppclabs_bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SimulBrowse [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SpiderMan [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Spinn3r [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Web\ Spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCapture [NC,OR]
RewriteCond %{HTTP_USER_AGENT} webcollage [NC,OR]

# BLOCK BAD BOTS - USED BUT NOT FOUND IN LOGS SAMPLED 2009-12
#==================================================
RewriteCond %{HTTP_USER_AGENT} ^DittoSpyder [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [NC]

# BLOCK BAD BOTS - EXECUTE RULE
#==================================================
RewriteRule !^(robots\.txt¦feed\.xml)$ - [F,L]

Note that the final instruction is intended to allow the blocked bots to both access the robots.txt file AND my RSS feed, but nothing else.


Thread source:: http://www.webmasterworld.com/search_engine_spiders/4046696.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com