homepage Welcome to WebmasterWorld Guest from 54.226.230.76
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Pubcon Platinum Sponsor 2014
Home / Forums Index / Hardware and OS Related Technologies / Linux, Unix, and *nix like Operating Systems
Forum Library, Charter, Moderators: bakedjake

Linux, Unix, and *nix like Operating Systems Forum

    
Can server block bots?
to control unwanted crawling
Reno




msg:3240031
 9:43 pm on Feb 1, 2007 (gmt 0)

I admittedly know nothing at all about the nuts & bolts of server technology, so please forgive the simplicity of this question:

Can a website hosting server tell the difference between a visitor going through individual pages on a website and a bot doing the same thing?

If the answer is "yes", then can a server be configured to reject (or redirect) ALL bot crawling that is not approved by the siteowner?

I do realize of course that robots.txt is supposed to perform that function, but it's also my understanding that some malicious bots will just ignore robots.txt. That being the case, I'm wondering if there is a way for the Linux/Unix server itself to block (via htaccess for example) any and all bot crawling that is not explicitedly allowed?

Thanks for any feedback...

......................................

 

boxfan




msg:3240099
 10:35 pm on Feb 1, 2007 (gmt 0)

Look at this thread

[webmasterworld.com...]

mcavic




msg:3240320
 4:02 am on Feb 2, 2007 (gmt 0)

As a preface to the link above: there's no way to detect all bots, because bots can behave exactly like a real Web browser. But most bots can be detected and blocked by checking the User Agent.

Reno




msg:3240326
 4:20 am on Feb 2, 2007 (gmt 0)

That was a very informative thread -- my eyes are now fuzzy and my brain numb! Given that I do not have any measurable level of expertise with server modification, I am taking to heart the admonition to not mess with the "RewriteCond %{HTTP_USER_AGENT}" htaccess code.

Luckily I just now found that my cPanel at the sites that are being heavily crawled has a feature called "IP Deny Manager", so I am carefully scrutinizing the cgi script I use to capture the ip addresses of all the crawlers, and am blocking those that are not identified (or that are identified but appear useless!). Hopefully that will save me a ton of bandwidth and will keep those rodents at bay for awhile.

Thanks for the feedback...

........................................

[edited by: Reno at 4:22 am (utc) on Feb. 2, 2007]

Matt Probert




msg:3240922
 5:45 pm on Feb 2, 2007 (gmt 0)

Try searching for "ipchains" or "iptables". These are firewalls that block one or a range of IP addresses from accessing your server.

I use them to deny the "Nicebot" access.

Matt

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Hardware and OS Related Technologies / Linux, Unix, and *nix like Operating Systems
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved