Welcome to WebmasterWorld Guest from 54.226.67.166

Forum Moderators: coopster & jatar k & phranque

Message Too Old, No Replies

Banning bots

     

Dexie

8:05 am on Jan 19, 2011 (gmt 0)

10+ Year Member



Some excellent info here. This list below seems to be the latest - any others to add? Especially for email grabbers and scrapers please?
<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

caribguy

8:32 am on Jan 19, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



Not so sure that this topic fits in the Perl forum...

You should take a look in the Search Engine Spider and User Agent Identification [webmasterworld.com] forum. One of the approaches that is mentioned there is whitelisting, which boils down to banning anything that does not pass the smell test.

In addition to bannning certain 'suspicious' user-agent strings, you might want to take a look at header information that is supplied, and also the ip range that a supposed visitor is accessing your site from. YMMV ;)

Dexie

10:28 pm on Jan 19, 2011 (gmt 0)

10+ Year Member



Sorry, yep, I'd better post this in another forum. Do you use your .htaccess file for any of this?

janharders

10:39 pm on Jan 19, 2011 (gmt 0)

WebmasterWorld Senior Member 5+ Year Member



The problem with those lists is, they only block those bots that are bad (ia_archiver is not if that's archive.org's bot) and stupid, but it doesn't block those that really want the information. Blocking harvesters is pretty much the same as blocking spam bots, you might want to look at "bad behaviour", it's a client fingerprinting-based solution that tries to identify bots that pose as regular browsers and denys them access.

phranque

7:24 am on Jan 20, 2011 (gmt 0)

WebmasterWorld Administrator phranque is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Whitelisting just the good bots - Search Engine Spider and User Agent Identification forum:
http://www.webmasterworld.com/search_engine_spiders/4255036.htm [webmasterworld.com]
 

Featured Threads

Hot Threads This Week

Hot Threads This Month