homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
Forum Library, Charter, Moderators: Ocean10000 & incrediBILL

Search Engine Spider and User Agent Identification Forum

Cookie based spider catcher

 5:00 pm on May 11, 2012 (gmt 0)

Since some newer spiders accept cookies I've put in some basic code to set a unique identifying cookie. During my testing phase for each page request, if the IP changes for that cookie I'm sent an email.

What I'm finding is I'm getting notification a lot of legitimate users, for instance certain wireless networks, military proxies, etc. change IPs during a user's session.

How do other's deal with this? DNS lookups? Whitelisting? ?



 3:55 pm on May 12, 2012 (gmt 0)

Since the response has been so overwhelming ;)

I don't use cookies, however frown upon users changing and/or using multiple IP's.

In many instances, the second IP is from a network that is cache my data, and without regarding the pages meta-tags not to do so.
The end result is that I'll deny future access to the second IP and possibly even the primary IP.

There is one network (escapes me) that utilizes as many as four IP's in addition to the users prinary IP.


 6:49 pm on May 12, 2012 (gmt 0)

I first check for other issues - bad UA, bad headers, server-farm etc. If bad I then block any change of IP as defined by the accepted session cookie (this is from IIS). I do not set cookies specifically.

Apart from that, an exception to blocking IP changes is required for (at least) AOL, which issues each user with a variety of IPs (is this still true?).

I would not send out an email for an IP change. They are far too frequent. It can sometimes run to dozens within a few minutes, almost always from botnet-using baddies. Instead I run a series of "security" logs which record different types of activity. I view these every once in a while through RDP.


 11:08 pm on May 12, 2012 (gmt 0)

I put in both time checks and duplicate checks for the emails I get.

So far:

Verizon wireless
Sprint wireless
T-mobile wireless
Hughes and Wild Blue satellite
US Navy, DHS, etc. users and their major contractors (Boeing for instance)
Some of the Fortune 500 companies

All these change IPs during sessions, with the same user (their cache fetches tend to use a different UA).

This just off the top of my head, there are several more "legit" ranges I've found which can change IPs which is why I asked... whitelisting seems like it would be a darn near impossible task because at the rate I'm going I'll have 100+ ranges by the end of the week with no end in sight!

Dstiles, I like your method. It lets the legits through and only catches the bots which try to sneak in under a different "persona."


 11:10 pm on May 12, 2012 (gmt 0)

AOL, which issues each user with a variety of IPs (is this still true?)

That was my impression too but they must not like me because I couldn't find any recent ones. But I had one only yesterday that annoyed me because the second IP was in a different b range, meaning I had to hand-check it. ###. Like this:

html 165.138.0.nn
all subsidiary pages 165.139.0.nn OR ..nn+1

To save everyone else the trip: The IP apparently belongs to the Indiana Department of Education. And the pattern definitely looks human.


 11:16 pm on May 12, 2012 (gmt 0)

I just added some custom logging, seems approriate to share the code:

# Error Logging 2012-05-11

$oldSetting = ignore_user_abort( TRUE );// otherwise can screw-up logfile

if( !empty( $GLOBALS[ '_SERVER' ])) {
} elseif( !empty( $GLOBALS[ 'HTTP_SERVER_VARS' ])) {
} else {

//$requestHost = ${$_SERVER_ARRAY}[ 'HTTP_HOST' ];
$requestHost = ${$_SERVER_ARRAY}[ 'SERVER_NAME' ];

if(stristr($requestHost, 'example.co.uk')) {
if(stristr($strReqHost, 'dev')) {
define( '_DIRECTORY', '/var/www/vhosts/example.co.uk/dev/httpdocs/includes/logging/' );
} else if(stristr($strReqHost, 'www')) {
define( '_DIRECTORY', '/var/www/vhosts/example.co.uk/www/httpdocs/includes/logging/' );

define( '_LOGFILE','errorlogfile.txt' );
define( '_LOGMAXLINES','1000' );

global ${$_SERVER_ARRAY};


$datetime = date( 'Y-m-d H:i:s O' );

$remoteIP = ${$_SERVER_ARRAY}[ 'REMOTE_ADDR' ];

$requestURI = ${$_SERVER_ARRAY}[ 'REQUEST_URI' ];

$userAgent = ( isset( ${$_SERVER_ARRAY}[ 'HTTP_USER_AGENT' ]))
: '<unknown user agent>';

$referer = ( isset( ${$_SERVER_ARRAY}[ 'HTTP_REFERER' ]))
: '<unknown referer>';

$logLine = $datetime . " - " . $remoteIP . " - ". $requestHost . " - ". $requestURI . " - ". $userAgent . " - ". $referer . "\n";

$log = file( $logFile );// flock() disabled in some kernels (eg 2.4)

if( $fp = fopen( $logFile, 'a' )) {// tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// fopen,fclose put close together as possible
while( count( $log ) >= _LOGMAXLINES ) array_shift( $log );
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $logFile, 'w' );
fputs( $fp, $logLine );
fclose( $fp );

ignore_user_abort( $oldSetting );


Global Options:
 top home search open messages active posts  

Home / Forums Index / Search Engines / Search Engine Spider and User Agent Identification
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved