Page is a not externally linkable
- Search Engines
-- Search Engine Spider and User Agent Identification
---- Updating bot ban list & cleaning out obsolete entries


KenB - 1:21 am on Dec 21, 2009 (gmt 0)


I'm sure most of us are familiar with the classic .httacces bad bot ban list for .htaccess that gets copied and pasted wholesale from web developer forum to forum (e.g.: [webmasterworld.com...] ).

Each iteration of this list gets longer and no body ever bothers to remove obsolete entries. Last year incrediBill started a thread for default UAs of programming libraries at [webmasterworld.com...] It's a shame a similar thread didn't get started for bad bots.

I'm working on testing my bad bot UA strings against a sampling of my server logs representing 10.6gb of data over 83 days, to find what strings were used. even though it is a small sampling of days, it is still a huge amount of data to search so it will take considerable time for me to test all of the entries in my bad bot list. Once completed I will share my condensed list, but I also hope others will help fill in gaps of the most active bad bot UAs.

It would also be good to have a discussion about what .htaccess methods are truly the fastest.

I saw some comments from some time back on Webmaster World where individuals were promoting using regular expressions to reduce the number of .htaccess entries. for instance:
RewriteCond %{HTTP_USER_AGENT} (efp@gmx\.net¦hhjhj@yahoo\.com¦lerly\.net¦mapfeatures\.net¦metacarta\.com) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Industry¦Internet¦IUFW¦Lincoln¦Missouri¦Program).?(Program¦Explore¦Web¦State¦College¦Shareware) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Mac¦Ram¦Educate¦WEP).?(Finder¦Search) [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^(Moz+illa¦MSIE).?[0-9]?.?[0-9]?[0-9]?$ [NC,OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla/[0-9]\.[0-9][0-9]?.\(compatible[\)\ ] [NC,OR]

I tried this method to condense my bad bot list and I found it actually increased my response times. I thought that maybe if I started a line with a fixed string before the regular expression it would be more efficient. For example:
RewriteCond %{HTTP_USER_AGENT} ^webpage(widget¦downloader¦scrapper¦harvester) [NC,OR])

My thinking was that Apache would quickly bail on the line and go to the next if the first character of the line didn't match the UA string, but this method still slowed down response times compared to having each bot on its own line.

It is a total pain to do, but cleaning out obsolete .htaccess entries can really improve website performance and Google is making an effort to promote faster loading webpages with their "Page Speed" tools. As such, it is probably only a matter of time before Google decides to include how quickly pages load into their SERP calculations. So cleaning up our bad bot countermeasures and finding ways to optimize our .htaccess files is probably a good idea.


Thread source:: http://www.webmasterworld.com/search_engine_spiders/4046696.htm
Brought to you by WebmasterWorld: http://www.webmasterworld.com