Welcome to WebmasterWorld Guest from 54.167.177.207

Forum Moderators: Ocean10000 & incrediBILL

Cookie based spider catcher

   
5:00 pm on May 11, 2012 (gmt 0)

10+ Year Member



Since some newer spiders accept cookies I've put in some basic code to set a unique identifying cookie. During my testing phase for each page request, if the IP changes for that cookie I'm sent an email.

What I'm finding is I'm getting notification a lot of legitimate users, for instance certain wireless networks, military proxies, etc. change IPs during a user's session.

How do other's deal with this? DNS lookups? Whitelisting? ?
3:55 pm on May 12, 2012 (gmt 0)

WebmasterWorld Senior Member wilderness is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



motor,
Since the response has been so overwhelming ;)

I don't use cookies, however frown upon users changing and/or using multiple IP's.

In many instances, the second IP is from a network that is cache my data, and without regarding the pages meta-tags not to do so.
The end result is that I'll deny future access to the second IP and possibly even the primary IP.

There is one network (escapes me) that utilizes as many as four IP's in addition to the users prinary IP.
6:49 pm on May 12, 2012 (gmt 0)

WebmasterWorld Senior Member dstiles is a WebmasterWorld Top Contributor of All Time 5+ Year Member



I first check for other issues - bad UA, bad headers, server-farm etc. If bad I then block any change of IP as defined by the accepted session cookie (this is from IIS). I do not set cookies specifically.

Apart from that, an exception to blocking IP changes is required for (at least) AOL, which issues each user with a variety of IPs (is this still true?).

I would not send out an email for an IP change. They are far too frequent. It can sometimes run to dozens within a few minutes, almost always from botnet-using baddies. Instead I run a series of "security" logs which record different types of activity. I view these every once in a while through RDP.
11:08 pm on May 12, 2012 (gmt 0)

10+ Year Member



I put in both time checks and duplicate checks for the emails I get.

So far:

AOL
Verizon wireless
Sprint wireless
T-mobile wireless
Hughes and Wild Blue satellite
US Navy, DHS, etc. users and their major contractors (Boeing for instance)
Some of the Fortune 500 companies

All these change IPs during sessions, with the same user (their cache fetches tend to use a different UA).

This just off the top of my head, there are several more "legit" ranges I've found which can change IPs which is why I asked... whitelisting seems like it would be a darn near impossible task because at the rate I'm going I'll have 100+ ranges by the end of the week with no end in sight!

Dstiles, I like your method. It lets the legits through and only catches the bots which try to sneak in under a different "persona."
11:10 pm on May 12, 2012 (gmt 0)

WebmasterWorld Senior Member lucy24 is a WebmasterWorld Top Contributor of All Time Top Contributors Of The Month



AOL, which issues each user with a variety of IPs (is this still true?)

That was my impression too but they must not like me because I couldn't find any recent ones. But I had one only yesterday that annoyed me because the second IP was in a different b range, meaning I had to hand-check it. ###. Like this:

html 165.138.0.nn
all subsidiary pages 165.139.0.nn OR ..nn+1

To save everyone else the trip: The IP apparently belongs to the Indiana Department of Education. And the pattern definitely looks human.
11:16 pm on May 12, 2012 (gmt 0)

WebmasterWorld Senior Member g1smd is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month



I just added some custom logging, seems approriate to share the code:
<?php

# Error Logging 2012-05-11

$oldSetting = ignore_user_abort( TRUE );// otherwise can screw-up logfile

if( !empty( $GLOBALS[ '_SERVER' ])) {
$_SERVER_ARRAY = '_SERVER';
} elseif( !empty( $GLOBALS[ 'HTTP_SERVER_VARS' ])) {
$_SERVER_ARRAY = 'HTTP_SERVER_VARS';
} else {
$_SERVER_ARRAY = 'GLOBALS';
}

//$requestHost = ${$_SERVER_ARRAY}[ 'HTTP_HOST' ];
$requestHost = ${$_SERVER_ARRAY}[ 'SERVER_NAME' ];

if(stristr($requestHost, 'example.co.uk')) {
if(stristr($strReqHost, 'dev')) {
define( '_DIRECTORY', '/var/www/vhosts/example.co.uk/dev/httpdocs/includes/logging/' );
} else if(stristr($strReqHost, 'www')) {
define( '_DIRECTORY', '/var/www/vhosts/example.co.uk/www/httpdocs/includes/logging/' );
}
}

define( '_LOGFILE','errorlogfile.txt' );
define( '_LOGMAXLINES','1000' );

global ${$_SERVER_ARRAY};


$logFile = _DIRECTORY . _LOGFILE;

$datetime = date( 'Y-m-d H:i:s O' );

$remoteIP = ${$_SERVER_ARRAY}[ 'REMOTE_ADDR' ];

$requestURI = ${$_SERVER_ARRAY}[ 'REQUEST_URI' ];

$userAgent = ( isset( ${$_SERVER_ARRAY}[ 'HTTP_USER_AGENT' ]))
? ${$_SERVER_ARRAY}[ 'HTTP_USER_AGENT' ]
: '<unknown user agent>';

$referer = ( isset( ${$_SERVER_ARRAY}[ 'HTTP_REFERER' ]))
? ${$_SERVER_ARRAY}[ 'HTTP_REFERER' ]
: '<unknown referer>';

$logLine = $datetime . " - " . $remoteIP . " - ". $requestHost . " - ". $requestURI . " - ". $userAgent . " - ". $referer . "\n";

$log = file( $logFile );// flock() disabled in some kernels (eg 2.4)

if( $fp = fopen( $logFile, 'a' )) {// tiny danger of 2 threads interfering; live with it
if( count( $log ) >= _LOGMAXLINES ) {// otherwise grows like Topsy
fclose( $fp );// fopen,fclose put close together as possible
while( count( $log ) >= _LOGMAXLINES ) array_shift( $log );
array_push( $log, $logLine );
$logLine= implode( '', $log );
$fp= fopen( $logFile, 'w' );
}
fputs( $fp, $logLine );
fclose( $fp );
}
exit();

ignore_user_abort( $oldSetting );

?>
 

Featured Threads

Hot Threads This Week

Hot Threads This Month