Welcome to WebmasterWorld Guest from 54.167.174.11

Forum Moderators: bakedjake

Message Too Old, No Replies

Can server block bots?

to control unwanted crawling

   
9:43 pm on Feb 1, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I admittedly know nothing at all about the nuts & bolts of server technology, so please forgive the simplicity of this question:

Can a website hosting server tell the difference between a visitor going through individual pages on a website and a bot doing the same thing?

If the answer is "yes", then can a server be configured to reject (or redirect) ALL bot crawling that is not approved by the siteowner?

I do realize of course that robots.txt is supposed to perform that function, but it's also my understanding that some malicious bots will just ignore robots.txt. That being the case, I'm wondering if there is a way for the Linux/Unix server itself to block (via htaccess for example) any and all bot crawling that is not explicitedly allowed?

Thanks for any feedback...

......................................

10:35 pm on Feb 1, 2007 (gmt 0)

5+ Year Member



Look at this thread

[webmasterworld.com...]

4:02 am on Feb 2, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



As a preface to the link above: there's no way to detect all bots, because bots can behave exactly like a real Web browser. But most bots can be detected and blocked by checking the User Agent.
4:20 am on Feb 2, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



That was a very informative thread -- my eyes are now fuzzy and my brain numb! Given that I do not have any measurable level of expertise with server modification, I am taking to heart the admonition to not mess with the "RewriteCond %{HTTP_USER_AGENT}" htaccess code.

Luckily I just now found that my cPanel at the sites that are being heavily crawled has a feature called "IP Deny Manager", so I am carefully scrutinizing the cgi script I use to capture the ip addresses of all the crawlers, and am blocking those that are not identified (or that are identified but appear useless!). Hopefully that will save me a ton of bandwidth and will keep those rodents at bay for awhile.

Thanks for the feedback...

........................................

[edited by: Reno at 4:22 am (utc) on Feb. 2, 2007]

5:45 pm on Feb 2, 2007 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Try searching for "ipchains" or "iptables". These are firewalls that block one or a range of IP addresses from accessing your server.

I use them to deny the "Nicebot" access.

Matt