homepage Welcome to WebmasterWorld Guest from 54.196.189.229
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Updating bot ban list & cleaning out obsolete entries
KenB




msg:4046698
 1:21 am on Dec 21, 2009 (gmt 0)

I'm sure most of us are familiar with the classic .httacces bad bot ban list for .htaccess that gets copied and pasted wholesale from web developer forum to forum (e.g.: [webmasterworld.com...] ).

Each iteration of this list gets longer and no body ever bothers to remove obsolete entries. Last year incrediBill started a thread for default UAs of programming libraries at [webmasterworld.com...] It's a shame a similar thread didn't get started for bad bots.

I'm working on testing my bad bot UA strings against a sampling of my server logs representing 10.6gb of data over 83 days, to find what strings were used. even though it is a small sampling of days, it is still a huge amount of data to search so it will take considerable time for me to test all of the entries in my bad bot list. Once completed I will share my condensed list, but I also hope others will help fill in gaps of the most active bad bot UAs.

It would also be good to have a discussion about what .htaccess methods are truly the fastest.

I saw some comments from some time back on Webmaster World where individuals were promoting using regular expressions to reduce the number of .htaccess entries. for instance:
RewriteCond %{HTTP_USER_AGENT} (efp@gmx\.net¦hhjhj@yahoo\.com¦lerly\.net¦mapfeatures\.net¦metacarta\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Industry¦Internet¦IUFW¦Lincoln¦Missouri¦Program).?(Program¦Explore¦Web¦State¦College¦Shareware) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Mac¦Ram¦Educate¦WEP).?(Finder¦Search) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Moz+illa¦MSIE).?[0-9]?.?[0-9]?[0-9]?$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9]\.[0-9][0-9]?.\(compatible[\)\ ] [NC,OR]

I tried this method to condense my bad bot list and I found it actually increased my response times. I thought that maybe if I started a line with a fixed string before the regular expression it would be more efficient. For example:
RewriteCond %{HTTP_USER_AGENT} ^webpage(widget¦downloader¦scrapper¦harvester) [NC,OR])

My thinking was that Apache would quickly bail on the line and go to the next if the first character of the line didn't match the UA string, but this method still slowed down response times compared to having each bot on its own line.

It is a total pain to do, but cleaning out obsolete .htaccess entries can really improve website performance and Google is making an effort to promote faster loading webpages with their "Page Speed" tools. As such, it is probably only a matter of time before Google decides to include how quickly pages load into their SERP calculations. So cleaning up our bad bot countermeasures and finding ways to optimize our .htaccess files is probably a good idea.

 

wilderness




msg:4047098
 6:05 pm on Dec 21, 2009 (gmt 0)

gets longer and no body ever bothers to remove obsolete entries

Unless the policies at Webmaster World have changed in the last few moments ;)

Participants are not allowed to edit submissions within a very short time frame, or even as soon as another participant submits a supplemental reply.

Volunteers moderators are simply too time short to go back and edit old submissions manually. Nor would most moderators like to even become involved in same capacity.

That old Close to Perfect Htaccess was extended by two threads [webmasterworld.com]

Many contributors to these threads provided bloated and even incorrect coding, the later of which was not edited in the original submission and remains unedited (and incorrect) today.

wilderness




msg:4047103
 6:16 pm on Dec 21, 2009 (gmt 0)

I tried this method to condense my bad bot list and I found it actually increased my response times. I thought that maybe if I started a line with a fixed string before the regular expression it would be more efficient. For example:
RewriteCond %{HTTP_USER_AGENT} ^webpage(widget¦downloader¦scrapper¦harvester) [NC,OR])

This line has some real issues.
1) WHY the trailing parentheses after [NC,OR] ?

2) Your Rewrite essentially reads the following:
User Agent BEGINS with the word webpage and is followed immediately (no trailing space by ANY of the words enclosed in parentheses.
2a) EX: one UA would read (at beginning) webpagewidget, which is MORE THAN very unlikely with what you believe your attempting to catch.

2b) No idea what your attempting to catch with that line?
Perhaps you could expand on what exactly your attempting to do, maybe even provide an existing UA that fits this criteria?

Don

KenB




msg:4047270
 10:47 pm on Dec 21, 2009 (gmt 0)

Unless the policies at Webmaster World have changed in the last few moments ;)

Participants are not allowed to edit submissions within a very short time frame, or even as soon as another participant submits a supplemental reply.

I was more thinking of a new thread where we start off with a short .htaccess list (which I'm working on) and then people submit entries they ACTUALLY found in their server logs recently. We could then periodically collect the entries up into an updated master list that would be posted to a new post in this thread.

I was thinking this would be more suited to a group effort in Google Wave, but I'm sure we could make this work. ;)

This line has some real issues.
1) WHY the trailing parentheses after [NC,OR] ?

Bad typing when too tired. It was a sample line I created to demonstrate my point, it isn't actual code I use. I originally had the sample code in line with the sentence within parentheses, then broke it off into a quote, but accidentally left the trailing parentheses.

2) Your Rewrite essentially reads the following:
User Agent BEGINS with the word webpage and is followed immediately (no trailing space by ANY of the words enclosed in parentheses.
2a) EX: one UA would read (at beginning) webpagewidget, which is MORE THAN very unlikely with what you believe your attempting to catch.

Yes, I was just fabricating a string off the top of my head to demonstrate what I meant. I didn't have one readily available to copy and paste.

2b) No idea what your attempting to catch with that line?

Nothing I was trying to make it obvious that it was fake so no one actually added it. Apparently my attempt failed. :(

Next time I'll use ^widget.?(blue¦red¦green) [NC,OR] ;)

wilderness




msg:4047301
 11:59 pm on Dec 21, 2009 (gmt 0)

There is something in the Forum Library which Jan (aka bull) put a lot of work into and could be updated.

[webmasterworld.com...]

The problem with these things is a lack of conformity (as per of the old Close to Perfect threads), and especially when some dweeb comes along and inserts their entire version of an htaccess, and irregardless if the file is a functioning file or not.
The repetition and length of such submissions turns off most regular participants.


jdMorgan




msg:4047312
 12:15 am on Dec 22, 2009 (gmt 0)

This method of access control is practically obsolete now, as the worst scrapers and spammers have now moved on, and most use 'real' browser User-Agent strings. Aside from WGET and indy library, there are very few of these UAs left 'in the wild'.

I'm currently using a whitelist of user-agents with full validation on all aspects of their requests, IP-address range blacklists, and several other techniques. These are all the UAs that currently remain on my blacklist:

Babya\ Discoverer
curl
EmailSiphon
^goof
grub-client
heritrix/
IECheck
Indy\ Library
Jakarta\ Commons-HttpClient
larbin
libidn
LWP::(Simple¦trivial)
Microsoft\ URL\ Control
Mozilla/[0-9.]+\ (ips-agent¦ \Beta\ \(Windows\)¦VB\ Project)¦Site\ Server¦Viewzi)
Nutch
^Toata\ dragostea
Plesk
POE-Component-Client-HTTP
Wget
^-$ (literally, a hyphen)

Your best source for UAs to block is your own server UA stats report; There's little use blocking a user-agent that never visits, or doesn't visit often enough to constitute a real problem (as you define that term).

Jim

KenB




msg:4047336
 1:33 am on Dec 22, 2009 (gmt 0)

This method of access control is practically obsolete now, as the worst scrapers and spammers have now moved on, and most use 'real' browser User-Agent strings. Aside from WGET and indy library, there are very few of these UAs left 'in the wild'.

This is what I was thinking and why I wanted to clean out my .htaccess file. I don't expect to stop many of the bad bots via UA any more, what I'm mostly hoping for is to reduce the number of individual users who try to rip my site for later offline use. The bad bots may be annoying, but they are also getting smarter about running under the radar, which includes not bringing my server to its knees by making too many requests too quickly. Its the individual user who decides they want to cache my entire site for their later use but have turned off speed controls that cause the real problems.

I'm on a shared hosting environment and my web host hasn't implemented any type of throttling mechanism I can use to limit how many requests an individual user can make over a given period of time.

My best hope seems to be a robust IP block list that targets server farms and a selective UA block list.

jdMorgan




msg:4047363
 2:55 am on Dec 22, 2009 (gmt 0)

You've seen this classic thread [webmasterworld.com], I hope...

Jim

KenB




msg:4047365
 3:03 am on Dec 22, 2009 (gmt 0)

No I hadn't seen that thread, I'll chew on it while searching logs one bot name at a time.

KenB




msg:4048032
 1:37 am on Dec 23, 2009 (gmt 0)

Okay, here is my updated .htaccess UA ban list. It is based on the classic list found at [webmasterworld.com...] and the list of default UA strings for programing libraries found at [webmasterworld.com...]

My methodology was to select a representative sample of logs from my website representing 85 days with a total uncompressed file size of 10.6gb. I put all of the selected log files into their own folder on my computer and appended ".txt" to the end of the file names so that Windows would search the files. I then searched the files using Windows for each UA string. If logs were returned I opened up a sampling of logs to verify the strings and look for potential IP ranges I could block. I also added new UA strings not found on the lists above based on what I was finding in my logs as I went along.

My hope is that others will post some of the UA strings they are blocking against that are actively hitting their server. I would also hope that folks would resist the urge to post monolithic lists that have not been cleaned out of inactive UA strings.

# PREVENT PREFETCHING OF PAGES
#=====================================
RewriteCond %{X-moz} ^prefetch [NC,OR]

# BLOCK DEFAULT UA OF PROGRAMMING LIBRARIES
#==================================================
RewriteCond %{HTTP_USER_AGENT} ^curl/ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^HTMLParser [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Jakarta\ Commons [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Java [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^libcurl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^libwww-perl [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^LWP::Simple [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^lwp-request [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ Data\ Access [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft\ URL\ Control [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^MS\ Web\ Services\ Client\ Protocol [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PECL::HTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^POE-Component-Client-HTTP [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^PycURL [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Python-urllib [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Snoopy [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^VB\ Project [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WWW::Mechanize [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Indy\ Library [NC,OR]
RewriteCond %{HTTP_USER_AGENT} RPT-HTTPClient [NC,OR]

# BLOCK BAD BOTS, ETC. - VERIFIED IN LOGS 2009-12
#==================================================
RewriteCond %{HTTP_USER_AGENT} ^$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^AISearchBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^al_viewer [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^amibot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^BDFetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^core-project [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Cuam\ Ver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^DoubleVerify\ Crawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Express\ WebPictures [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/(2¦3)\.0 [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/([1-9])\.([0-9])\ http [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/([1-9])\.([0-9])$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Offline\ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Ruby [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^SBL-BOT [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Space\ Bison [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Squid-Prefetch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Twisted\ PageGetter [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebCopier [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebImages [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^YebolBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Zend_Http_Client [NC,OR]
RewriteCond %{HTTP_USER_AGENT} 80legs [NC,OR]
RewriteCond %{HTTP_USER_AGENT} aiHitBot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Arachmo [NC,OR]
RewriteCond %{HTTP_USER_AGENT} asynchttp [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Downloader [NC,OR]
RewriteCond %{HTTP_USER_AGENT} DreamPassport [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Email [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Exabot-Thumbnails [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Extractor [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Fetch\ API\ Request [NC,OR]
RewriteCond %{HTTP_USER_AGENT} HTTrack [NC,OR]
RewriteCond %{HTTP_USER_AGENT} larbin [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MS\ FrontPage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MSFrontPage [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MSIECrawler [NC,OR]
RewriteCond %{HTTP_USER_AGENT} MyDiGiRabi [NC,OR]
RewriteCond %{HTTP_USER_AGENT} NEWT [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Nutch [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ppclabs_bot [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SimulBrowse [NC,OR]
RewriteCond %{HTTP_USER_AGENT} SpiderMan [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Spinn3r [NC,OR]
RewriteCond %{HTTP_USER_AGENT} Web\ Spider [NC,OR]
RewriteCond %{HTTP_USER_AGENT} WebCapture [NC,OR]
RewriteCond %{HTTP_USER_AGENT} webcollage [NC,OR]

# BLOCK BAD BOTS - USED BUT NOT FOUND IN LOGS SAMPLED 2009-12
#==================================================
RewriteCond %{HTTP_USER_AGENT} ^DittoSpyder [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Download [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport\ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^WebZIP [NC]

# BLOCK BAD BOTS - EXECUTE RULE
#==================================================
RewriteRule !^(robots\.txt¦feed\.xml)$ - [F,L]

Note that the final instruction is intended to allow the blocked bots to both access the robots.txt file AND my RSS feed, but nothing else.

keyplyr




msg:4050989
 8:15 pm on Dec 29, 2009 (gmt 0)

Thanks KenB for the heads-up. For the last 8 or 9 years I've been adding to my htaccess and it was getting huge (20k.) 6 months ago with all the chatter about site speed from Yahoo and Google, I started to condense my IP block code by switching from mod_rewrite to mod_access and start several UA white lists. But I still had too much fluff.

Seeing this thread was a reminder that I need to continue to optimize. Hard to let go of things that once gave that feeling of protection, but alas many of these bad boys just do no not look like this any longer.

I've go the file size of htaccess down to 9kb now, but better yet the number of processes being run has been reduced by 60%. I see a noticeably benefit in page load time.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved