Welcome to WebmasterWorld Guest from 54.163.100.58

Forum Moderators: Ocean10000 & incrediBILL

Exclusion of bots from dynamic pages

A flexible alternative to robots.txt

   
6:18 pm on Apr 13, 2001 (gmt 0)




I run a site that has hundreds of thousands of pages that are generated dynamically. Each of these pages contains links to anywhere from a few to a few hundred of the other pages. Six months ago I let Google in, after discovering that their algorithms are the only ones that work well on my pages. About every five weeks, Google spends 10 days crawling with about 5 crawlers, 24/7, and then gets tired and comes back next month.

I lifted the robots.txt exclusion on my cgi-bin directory in order to let Google in. Then I added a bunch of other bots to my own exclusion file, so that when they try to come in, they get a "Server too busy" message. I know I could have tried to finesse the robots.txt by adding "Disallows" for each of these other bots, but I didn't think this would be reliable.

Yesterday I discovered an alternative method to dynamically decide what's a bot and what isn't. I've been writing the HTTP_FROM string to my own cgi-bin log files for several years now, so I'm in a position to know who uses it and who doesn't. After studying the last few months of logs, I've decided that most of the major bots use this field, and apart from bots, only the rare misconfigured browser ever uses it. Originally it was intended for an email address.

In other words, this environment variable is a good way for a cgi program to make a fast, first-level, up-front determination about whether a request is coming from a bot. Of course, it won't help you with those personal spiders, and you need additional levels of monitoring to keep your bandwidth safe from them, but as a first-level filter it seems to be a pretty good trick.

The best second-level monitoring I've come up with is to look at the end of your log file and count the number of hits that match each line to a depth just past the minute digit. That includes the domain and the time, but not the current second. If the rate exceeds a certain figure per minute, it has to be a bot because no one can read your pages that fast!

5:15 pm on Apr 16, 2001 (gmt 0)

WebmasterWorld Administrator brett_tabke is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



Nice catch Everyman. Yes, if you study and compare headers from spiders, there are a couple of glaring differences. HTTP_From is not one I focus on entirely because there are some that don't use it.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month