homepage Welcome to WebmasterWorld Guest from 54.226.213.228
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Apache Web Server
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL & phranque

Apache Web Server Forum

    
Blocking bots
Some uniform way of doing this?
designaweb




msg:4053584
 8:32 am on Jan 4, 2010 (gmt 0)

Had the lightspeed bot crawling parts of my site I don't want to have crawled. Most likely, other bots are doing this too.

Is there some uniform way to block unwanted bots? Someone with a "close-to-complete" list of what IP's to block?

 

jdMorgan




msg:4053729
 2:30 pm on Jan 4, 2010 (gmt 0)

Any such IP address-range list would be A) Huge, and B) Out-of-date by next week.
Use your server logs and stats to identify actual problems and address those first.

The first thing to do is to block this particular 'bot. Then investigate how it is that it thought it was allowed to spider parts of your site that you don't want spidered -- Is there a problem or a logical inconsistency in your use of robots.txt and on-page meta-robots tags, for example? Or does it simply not obey your valid robots.txt and on-page robots-control directives?

For legitimate but unwanted 'bots that always identify themselves, Robots.txt and User-Agent-string -based access controls are the 'least expensive' in terms of list size and server performance impact. Next would be large IP address-range blocks (but be careful, as the correlation of ranges to countries, regions, or companies is often very poor), and finally, smaller specific IP address ranges for specific robots.

Another avenue to pursue is that of whitelisting. Rather than try to deny everything that's 'bad,' consider allowing only those accesses that you deem to be 'good'. Careful analysis of incoming requests, looking at *all* of the HTTP headers sent with each request (many of which do not show up in standard server logs) can be quite telling. For example, some Googlebot requests are fake, and don't come from Google at all. But this isn't obvious until you check some of the additional HTTP headers not usually logged by servers.

Also take a look at the bad-bot scripts posted here by Key_Master, xlcus, and AlexK over in the PHP and PERL forum libraries. These implement behavior-based access control methods, and provide yet another 'angle' on the access control problem.

The subject of access control including robot control and the more-general user-agent control is quite wide and very deep. So it's a good idea to address your immediate problem and then spend a year or so developing what you feel is an appropriate access-control policy for your site. After 'paying attention' to this problem for an extended period of time, you'll find that what you want and need to do becomes quite a bit clearer. As a result, you'll waste far less time implementing great solutions to the wrong problems.

I guess this is a very long way to say that there is no one-size-fits-all solution, and it wouldn't be wise to just try to copy-and-paste a solution off "some Web site on the internet."

Jim

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Apache Web Server
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved