Welcome to WebmasterWorld Guest from 54.198.200.157

Forum Moderators: bakedjake

Message Too Old, No Replies

Can server block bots?

to control unwanted crawling

     
9:43 pm on Feb 1, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 9, 2001
posts:1307
votes: 0


I admittedly know nothing at all about the nuts & bolts of server technology, so please forgive the simplicity of this question:

Can a website hosting server tell the difference between a visitor going through individual pages on a website and a bot doing the same thing?

If the answer is "yes", then can a server be configured to reject (or redirect) ALL bot crawling that is not approved by the siteowner?

I do realize of course that robots.txt is supposed to perform that function, but it's also my understanding that some malicious bots will just ignore robots.txt. That being the case, I'm wondering if there is a way for the Linux/Unix server itself to block (via htaccess for example) any and all bot crawling that is not explicitedly allowed?

Thanks for any feedback...

......................................

10:35 pm on Feb 1, 2007 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 4, 2006
posts: 77
votes: 0


Look at this thread

[webmasterworld.com...]

4:02 am on Feb 2, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Mar 31, 2003
posts:1316
votes: 0


As a preface to the link above: there's no way to detect all bots, because bots can behave exactly like a real Web browser. But most bots can be detected and blocked by checking the User Agent.
4:20 am on Feb 2, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Dec 9, 2001
posts:1307
votes: 0


That was a very informative thread -- my eyes are now fuzzy and my brain numb! Given that I do not have any measurable level of expertise with server modification, I am taking to heart the admonition to not mess with the "RewriteCond %{HTTP_USER_AGENT}" htaccess code.

Luckily I just now found that my cPanel at the sites that are being heavily crawled has a feature called "IP Deny Manager", so I am carefully scrutinizing the cgi script I use to capture the ip addresses of all the crawlers, and am blocking those that are not identified (or that are identified but appear useless!). Hopefully that will save me a ton of bandwidth and will keep those rodents at bay for awhile.

Thanks for the feedback...

........................................

[edited by: Reno at 4:22 am (utc) on Feb. 2, 2007]

5:45 pm on Feb 2, 2007 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Aug 11, 2004
posts:1014
votes: 0


Try searching for "ipchains" or "iptables". These are firewalls that block one or a range of IP addresses from accessing your server.

I use them to deny the "Nicebot" access.

Matt