homepage Welcome to WebmasterWorld Guest from 107.20.131.154
register, free tools, login, search, subscribe, help, library, announcements, recent posts, open posts,
Subscribe to WebmasterWorld

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

    
Exclusion of bots from dynamic pages
A flexible alternative to robots.txt
Everyman




msg:402008
 6:18 pm on Apr 13, 2001 (gmt 0)


I run a site that has hundreds of thousands of pages that are generated dynamically. Each of these pages contains links to anywhere from a few to a few hundred of the other pages. Six months ago I let Google in, after discovering that their algorithms are the only ones that work well on my pages. About every five weeks, Google spends 10 days crawling with about 5 crawlers, 24/7, and then gets tired and comes back next month.

I lifted the robots.txt exclusion on my cgi-bin directory in order to let Google in. Then I added a bunch of other bots to my own exclusion file, so that when they try to come in, they get a "Server too busy" message. I know I could have tried to finesse the robots.txt by adding "Disallows" for each of these other bots, but I didn't think this would be reliable.

Yesterday I discovered an alternative method to dynamically decide what's a bot and what isn't. I've been writing the HTTP_FROM string to my own cgi-bin log files for several years now, so I'm in a position to know who uses it and who doesn't. After studying the last few months of logs, I've decided that most of the major bots use this field, and apart from bots, only the rare misconfigured browser ever uses it. Originally it was intended for an email address.

In other words, this environment variable is a good way for a cgi program to make a fast, first-level, up-front determination about whether a request is coming from a bot. Of course, it won't help you with those personal spiders, and you need additional levels of monitoring to keep your bandwidth safe from them, but as a first-level filter it seems to be a pretty good trick.

The best second-level monitoring I've come up with is to look at the end of your log file and count the number of hits that match each line to a depth just past the minute digit. That includes the domain and the time, but not the current second. If the rate exceeds a certain figure per minute, it has to be a bot because no one can read your pages that fast!

 

Brett_Tabke




msg:402009
 5:15 pm on Apr 16, 2001 (gmt 0)

Nice catch Everyman. Yes, if you study and compare headers from spiders, there are a couple of glaring differences. HTTP_From is not one I focus on entirely because there are some that don't use it.

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About
© Webmaster World 1996-2014 all rights reserved