Forum Moderators: open
Each iteration of this list gets longer and no body ever bothers to remove obsolete entries. Last year incrediBill started a thread for default UAs of programming libraries at [webmasterworld.com...] It's a shame a similar thread didn't get started for bad bots.
I'm working on testing my bad bot UA strings against a sampling of my server logs representing 10.6gb of data over 83 days, to find what strings were used. even though it is a small sampling of days, it is still a huge amount of data to search so it will take considerable time for me to test all of the entries in my bad bot list. Once completed I will share my condensed list, but I also hope others will help fill in gaps of the most active bad bot UAs.
It would also be good to have a discussion about what .htaccess methods are truly the fastest.
I saw some comments from some time back on Webmaster World where individuals were promoting using regular expressions to reduce the number of .htaccess entries. for instance:
RewriteCond %{HTTP_USER_AGENT} (efp@gmx\.net¦hhjhj@yahoo\.com¦lerly\.net¦mapfeatures\.net¦metacarta\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Industry¦Internet¦IUFW¦Lincoln¦Missouri¦Program).?(Program¦Explore¦Web¦State¦College¦Shareware) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Mac¦Ram¦Educate¦WEP).?(Finder¦Search) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Moz+illa¦MSIE).?[0-9]?.?[0-9]?[0-9]?$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9]\.[0-9][0-9]?.\(compatible[\)\ ] [NC,OR]
I tried this method to condense my bad bot list and I found it actually increased my response times. I thought that maybe if I started a line with a fixed string before the regular expression it would be more efficient. For example:
RewriteCond %{HTTP_USER_AGENT} ^webpage(widget¦downloader¦scrapper¦harvester) [NC,OR])
It is a total pain to do, but cleaning out obsolete .htaccess entries can really improve website performance and Google is making an effort to promote faster loading webpages with their "Page Speed" tools. As such, it is probably only a matter of time before Google decides to include how quickly pages load into their SERP calculations. So cleaning up our bad bot countermeasures and finding ways to optimize our .htaccess files is probably a good idea.
gets longer and no body ever bothers to remove obsolete entries
Unless the policies at Webmaster World have changed in the last few moments ;)
Participants are not allowed to edit submissions within a very short time frame, or even as soon as another participant submits a supplemental reply.
Volunteers moderators are simply too time short to go back and edit old submissions manually. Nor would most moderators like to even become involved in same capacity.
That old Close to Perfect Htaccess was extended by two threads [webmasterworld.com]
Many contributors to these threads provided bloated and even incorrect coding, the later of which was not edited in the original submission and remains unedited (and incorrect) today.
I tried this method to condense my bad bot list and I found it actually increased my response times. I thought that maybe if I started a line with a fixed string before the regular expression it would be more efficient. For example:
RewriteCond %{HTTP_USER_AGENT} ^webpage(widget¦downloader¦scrapper¦harvester) [NC,OR])
This line has some real issues.
1) WHY the trailing parentheses after [NC,OR] ?
2) Your Rewrite essentially reads the following:
User Agent BEGINS with the word webpage and is followed immediately (no trailing space by ANY of the words enclosed in parentheses.
2a) EX: one UA would read (at beginning) webpagewidget, which is MORE THAN very unlikely with what you believe your attempting to catch.
2b) No idea what your attempting to catch with that line?
Perhaps you could expand on what exactly your attempting to do, maybe even provide an existing UA that fits this criteria?
Don
Unless the policies at Webmaster World have changed in the last few moments ;)Participants are not allowed to edit submissions within a very short time frame, or even as soon as another participant submits a supplemental reply.
I was more thinking of a new thread where we start off with a short .htaccess list (which I'm working on) and then people submit entries they ACTUALLY found in their server logs recently. We could then periodically collect the entries up into an updated master list that would be posted to a new post in this thread.
I was thinking this would be more suited to a group effort in Google Wave, but I'm sure we could make this work. ;)
This line has some real issues.
1) WHY the trailing parentheses after [NC,OR] ?
2) Your Rewrite essentially reads the following:
User Agent BEGINS with the word webpage and is followed immediately (no trailing space by ANY of the words enclosed in parentheses.
2a) EX: one UA would read (at beginning) webpagewidget, which is MORE THAN very unlikely with what you believe your attempting to catch.
2b) No idea what your attempting to catch with that line?
Next time I'll use ^widget.?(blue¦red¦green) [NC,OR] ;)
[webmasterworld.com...]
The problem with these things is a lack of conformity (as per of the old Close to Perfect threads), and especially when some dweeb comes along and inserts their entire version of an htaccess, and irregardless if the file is a functioning file or not.
The repetition and length of such submissions turns off most regular participants.
I'm currently using a whitelist of user-agents with full validation on all aspects of their requests, IP-address range blacklists, and several other techniques. These are all the UAs that currently remain on my blacklist:
Babya\ Discoverer
curl
EmailSiphon
^goof
grub-client
heritrix/
IECheck
Indy\ Library
Jakarta\ Commons-HttpClient
larbin
libidn
LWP::(Simple¦trivial)
Microsoft\ URL\ Control
Mozilla/[0-9.]+\ (ips-agent¦ \Beta\ \(Windows\)¦VB\ Project)¦Site\ Server¦Viewzi)
Nutch
^Toata\ dragostea
Plesk
POE-Component-Client-HTTP
Wget
^-$ (literally, a hyphen)
Your best source for UAs to block is your own server UA stats report; There's little use blocking a user-agent that never visits, or doesn't visit often enough to constitute a real problem (as you define that term).
Jim
This method of access control is practically obsolete now, as the worst scrapers and spammers have now moved on, and most use 'real' browser User-Agent strings. Aside from WGET and indy library, there are very few of these UAs left 'in the wild'.
I'm on a shared hosting environment and my web host hasn't implemented any type of throttling mechanism I can use to limit how many requests an individual user can make over a given period of time.
My best hope seems to be a robust IP block list that targets server farms and a selective UA block list.
My methodology was to select a representative sample of logs from my website representing 85 days with a total uncompressed file size of 10.6gb. I put all of the selected log files into their own folder on my computer and appended ".txt" to the end of the file names so that Windows would search the files. I then searched the files using Windows for each UA string. If logs were returned I opened up a sampling of logs to verify the strings and look for potential IP ranges I could block. I also added new UA strings not found on the lists above based on what I was finding in my logs as I went along.
My hope is that others will post some of the UA strings they are blocking against that are actively hitting their server. I would also hope that folks would resist the urge to post monolithic lists that have not been cleaned out of inactive UA strings.
# PREVENT PREFETCHING OF PAGES
#=====================================
RewriteCond %{X-moz} ^prefetch [NC,OR]# BLOCK DEFAULT UA OF PROGRAMMING LIBRARIES
#==================================================
RewriteCond %{HTTP_USER_AGENT} ^curl/ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HTMLParser [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Jakarta\ Commons [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Java [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^libcurl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^libwww-perl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LWP::Simple [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^lwp-request [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ Data\ Access [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL\ Control [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MS\ Web\ Services\ Client\ Protocol [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PECL::HTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^POE-Component-Client-HTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PycURL [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Python-urllib [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Snoopy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^VB\ Project [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WWW::Mechanize [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} RPT-HTTPClient [NC,OR]# BLOCK BAD BOTS, ETC. - VERIFIED IN LOGS 2009-12
#==================================================
RewriteCond %{HTTP_USER_AGENT} ^$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^AISearchBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^al_viewer [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^amibot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^BDFetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^core-project [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Cuam\ Ver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DoubleVerify\ Crawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/(2¦3)\.0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/([1-9])\.([0-9])\ http [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/([1-9])\.([0-9])$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Ruby [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SBL-BOT [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Space\ Bison [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Squid-Prefetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Twisted\ PageGetter [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebImages [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^YebolBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Zend_Http_Client [NC,OR]
RewriteCond %{HTTP_USER_AGENT} 80legs [NC,OR]
RewriteCond %{HTTP_USER_AGENT} aiHitBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Arachmo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} asynchttp [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} DreamPassport [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Email [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Exabot-Thumbnails [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Extractor [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Fetch\ API\ Request [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MS\ FrontPage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MSFrontPage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MSIECrawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MyDiGiRabi [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NEWT [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ppclabs_bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SimulBrowse [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SpiderMan [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Spinn3r [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Web\ Spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCapture [NC,OR]
RewriteCond %{HTTP_USER_AGENT} webcollage [NC,OR]# BLOCK BAD BOTS - USED BUT NOT FOUND IN LOGS SAMPLED 2009-12
#==================================================
RewriteCond %{HTTP_USER_AGENT} ^DittoSpyder [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [NC]# BLOCK BAD BOTS - EXECUTE RULE
#==================================================
RewriteRule !^(robots\.txt¦feed\.xml)$ - [F,L]
Note that the final instruction is intended to allow the blocked bots to both access the robots.txt file AND my RSS feed, but nothing else.
Seeing this thread was a reminder that I need to continue to optimize. Hard to let go of things that once gave that feeling of protection, but alas many of these bad boys just do no not look like this any longer.
I've go the file size of htaccess down to 9kb now, but better yet the number of processes being run has been reduced by 60%. I see a noticeably benefit in page load time.