homepage Welcome to WebmasterWorld Guest from 184.73.40.21
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
Forum Library, Charter, Moderators: coopster & jatar k & phranque

Perl Server Side CGI Scripting Forum

    
Banning bots
Dexie

5+ Year Member



 
Msg#: 4255039 posted 8:05 am on Jan 19, 2011 (gmt 0)

Some excellent info here. This list below seems to be the latest - any others to add? Especially for email grabbers and scrapers please?
<Files .htaccess>
deny from all
</Files>
RewriteEngine on
RewriteBase /
RewriteCond %{HTTP_USER_AGENT} ^EmailSiphon [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailWolf [OR]
RewriteCond %{HTTP_USER_AGENT} ^ExtractorPro [OR]
RewriteCond %{HTTP_USER_AGENT} ^Mozilla.*NEWT [OR]
RewriteCond %{HTTP_USER_AGENT} ^Crescent [OR]
RewriteCond %{HTTP_USER_AGENT} ^CherryPicker [OR]
RewriteCond %{HTTP_USER_AGENT} ^[Ww]eb[Bb]andit [OR]
RewriteCond %{HTTP_USER_AGENT} ^WebEMailExtrac.* [OR]
RewriteCond %{HTTP_USER_AGENT} ^NICErsPRO [OR]
RewriteCond %{HTTP_USER_AGENT} ^Teleport [OR]
RewriteCond %{HTTP_USER_AGENT} ^Zeus.*Webster [OR]
RewriteCond %{HTTP_USER_AGENT} ^Microsoft.URL [OR]
RewriteCond %{HTTP_USER_AGENT} ^Wget [OR]
RewriteCond %{HTTP_USER_AGENT} ^LinkWalker [OR]
RewriteCond %{HTTP_USER_AGENT} ^sitecheck.internetseer.com [OR]
RewriteCond %{HTTP_USER_AGENT} ^ia_archiver [OR]
RewriteCond %{HTTP_USER_AGENT} ^DIIbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^psbot [OR]
RewriteCond %{HTTP_USER_AGENT} ^EmailCollector
RewriteRule ^.* - [F]
RewriteCond %{HTTP_REFERER} ^http://www.iaea.org$
RewriteRule !^http://[^/.]\.your-site.com.* - [F]

 

caribguy

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4255039 posted 8:32 am on Jan 19, 2011 (gmt 0)

Not so sure that this topic fits in the Perl forum...

You should take a look in the Search Engine Spider and User Agent Identification [webmasterworld.com] forum. One of the approaches that is mentioned there is whitelisting, which boils down to banning anything that does not pass the smell test.

In addition to bannning certain 'suspicious' user-agent strings, you might want to take a look at header information that is supplied, and also the ip range that a supposed visitor is accessing your site from. YMMV ;)

Dexie

5+ Year Member



 
Msg#: 4255039 posted 10:28 pm on Jan 19, 2011 (gmt 0)

Sorry, yep, I'd better post this in another forum. Do you use your .htaccess file for any of this?

janharders

WebmasterWorld Senior Member 5+ Year Member



 
Msg#: 4255039 posted 10:39 pm on Jan 19, 2011 (gmt 0)

The problem with those lists is, they only block those bots that are bad (ia_archiver is not if that's archive.org's bot) and stupid, but it doesn't block those that really want the information. Blocking harvesters is pretty much the same as blocking spam bots, you might want to look at "bad behaviour", it's a client fingerprinting-based solution that tries to identify bots that pose as regular browsers and denys them access.

phranque

WebmasterWorld Administrator phranque us a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



 
Msg#: 4255039 posted 7:24 am on Jan 20, 2011 (gmt 0)

Whitelisting just the good bots - Search Engine Spider and User Agent Identification forum:
http://www.webmasterworld.com/search_engine_spiders/4255036.htm [webmasterworld.com]

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Perl Server Side CGI Scripting
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved