homepage Welcome to WebmasterWorld Guest from 107.22.70.215
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe and Support WebmasterWorld
Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

    
Banning bots
Dexie




msg:4255041
 8:05 am on Jan 19, 2011 (gmt 0)

Some excellent info here. This list below seems to be the latest - any others to add? Especially for email grabbers and scrapers please?
<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

 

caribguy




msg:4255049
 8:32 am on Jan 19, 2011 (gmt 0)

Not so sure that this topic fits in the Perl forum...

You should take a look in the Search Engine Spider and User Agent Identification [webmasterworld.com] forum. One of the approaches that is mentioned there is whitelisting, which boils down to banning anything that does not pass the smell test.

In addition to bannning certain 'suspicious' user-agent strings, you might want to take a look at header information that is supplied, and also the ip range that a supposed visitor is accessing your site from. YMMV ;)

Dexie




msg:4255432
 10:28 pm on Jan 19, 2011 (gmt 0)

Sorry, yep, I'd better post this in another forum. Do you use your .htaccess file for any of this?

janharders




msg:4255447
 10:39 pm on Jan 19, 2011 (gmt 0)

The problem with those lists is, they only block those bots that are bad (ia_archiver is not if that's archive.org's bot) and stupid, but it doesn't block those that really want the information. Blocking harvesters is pretty much the same as blocking spam bots, you might want to look at "bad behaviour", it's a client fingerprinting-based solution that tries to identify bots that pose as regular browsers and denys them access.

phranque




msg:4255582
 7:24 am on Jan 20, 2011 (gmt 0)

Whitelisting just the good bots - Search Engine Spider and User Agent Identification forum:
http://www.webmasterworld.com/search_engine_spiders/4255036.htm [webmasterworld.com]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved