writing a function for identifying spiders

Forum Moderators: open

Message Too Old, No Replies

writing a function for identifying spiders

what will be the algo

jaski

6:22 am on Nov 28, 2002 (gmt 0)

I am about to write a function (in PHP) which identifies the visitors on my site as spider or human.
What will be a good algo for that.

The reason I need this is that if its a spider I will not start a session. If its a human user I need to track his session.

And the reason for this is that I don't want session variables in my urls indexed by spiders.

The rudimentary function that I have written is simply this
function is_bot($HTTP_USER_AGENT)
{
if(substr_count($HTTP_USER_AGENT, 'bot') == 0)
return true;
else
return false;
}

All this does is check if string 'bot' appears in the useragent ..

This should take care of google atleast...and some others.

Any suggestions. This is a lil bit of cloacking so I am posting here :)

<Edit>Realized after posting that there is a forum dedicated to spiders .. so moderators please feel free to move it there if it belongs there, The word cloacking was all in my head when I was thinking of this so I didn't think twice before posting it here. :)</Edit>

Dreamquick

10:14 am on Nov 28, 2002 (gmt 0)

FWIW I like the following, its pretty effective and rarely turns up false positives (I use it as a check along the lines of "MaybeBot()" for an anti-spam script).

The following are all regular expressions;

"bot"
"robot"
"spider"
"crawler"
"agent"
"validator"

"perl/\d\.\d+"
"python"
"php/\d\.\d"
"java\d\.\d.\d"
"curl/\d\.\d\.\d"
"lwp-request/\d\.\d+"

"^[a-z][a-z]/\d\.\d+$"
"^[a-z]+$"

"@"
"\sat\s"
"http://"

Group 1 should have an obvious origin in that they are all "crawler" derived words.

Group 2 are programming tools which are commonly seen (could add more to this but these are the main non-malicious ones I see).

Group 3 covers the text-only and/or basic two-letter + version user-agents.

Finally group 4 contains the "got URL/EMail" filter which scans for either a URL or email identifier in the user-agent.

You might also want to consider checking the "FROM" http header since any half-decent engine or 3rd party crawler seems to ensure they have this set.

- Tony

DaveAtIFG

3:11 pm on Nov 28, 2002 (gmt 0)

Human visitors don't request a robots.txt file so, when looking for spiders, I sort my logs by identifying an IP that requests it, then extract all subsequent visits from that IP. Using something similar may serve as an effective "pre-filter," if a visitor asks for robots, then check the spider list, else it's a human visiting.

Dreamquick

11:04 am on Nov 29, 2002 (gmt 0)

Ah yes but what about those robots which don't request robots.txt?

Also what about those users that are inquisitive / annoying enough to request robots.txt?

Assuming you got a few AOL'ers doing that thanks to the "miracle" (or travesty / abomination) that is their load balancing proxies you might find yourself dropping a lot of valid traffic.

- Tony

WebGuerrilla

11:13 am on Nov 29, 2002 (gmt 0)

The other problem with using the robots.txt is the fact that even good bots that ask for it, don't ask for it each time they visit.
google can hit your site dozens of individual times in a single day, but only request the robots.txt file on the first visit.

tomasz

10:18 pm on Dec 2, 2002 (gmt 0)

I found the best way is to store IP in database.

Nick_W

10:24 pm on Dec 2, 2002 (gmt 0)

I wrote a very simplistic one for just this thing. Just put something from the UA string in the array and you've got it...


[pre]
<?
/* Use this to start a session only if the UA is *not* at search engine
to avoid duplicate content issues with url propagation of SID's */$searchengines=array("Google", "Fast", "Slurp", "Ink", "ia_archiver", "Scooter");
$is_search_engine=0;
foreach($searchengines as $key => $val) {
  if(strstr("$HTTP_USER_AGENT", $val)) {
    $is_search_engine++;
  }
}
if($is_search_engine==0) { // Not a search engine
  
  /* You can put anything in here that needs to be
  hidden from searchengines */
  session_start();
  
} else { // Is a search engine
  
  /* Put anything you want only for searchengines in here */
  $foo=$bar;
  
}?>
[/pre]

Hope that helps...

Nick