Welcome to WebmasterWorld Guest from 54.163.25.166

Forum Moderators: open

Message Too Old, No Replies

writing a function for identifying spiders

what will be the algo

     
6:22 am on Nov 28, 2002 (gmt 0)

Preferred Member

10+ Year Member

joined:Aug 11, 2002
posts:388
votes: 0


I am about to write a function (in PHP) which identifies the visitors on my site as spider or human.
What will be a good algo for that.

The reason I need this is that if its a spider I will not start a session. If its a human user I need to track his session.

And the reason for this is that I don't want session variables in my urls indexed by spiders.

The rudimentary function that I have written is simply this
function is_bot($HTTP_USER_AGENT)
{
if(substr_count($HTTP_USER_AGENT, 'bot') == 0)
return true;
else
return false;
}

All this does is check if string 'bot' appears in the useragent ..

This should take care of google atleast...and some others.

Any suggestions. This is a lil bit of cloacking so I am posting here :)

<Edit>Realized after posting that there is a forum dedicated to spiders .. so moderators please feel free to move it there if it belongs there, The word cloacking was all in my head when I was thinking of this so I didn't think twice before posting it here. :)</Edit>

10:14 am on Nov 28, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 25, 2002
posts:872
votes: 0


FWIW I like the following, its pretty effective and rarely turns up false positives (I use it as a check along the lines of "MaybeBot()" for an anti-spam script).

The following are all regular expressions;

"bot"
"robot"
"spider"
"crawler"
"agent"
"validator"

"perl/\d\.\d+"
"python"
"php/\d\.\d"
"java\d\.\d.\d"
"curl/\d\.\d\.\d"
"lwp-request/\d\.\d+"

"^[a-z][a-z]/\d\.\d+$"
"^[a-z]+$"

"@"
"\sat\s"
"http://"

Group 1 should have an obvious origin in that they are all "crawler" derived words.

Group 2 are programming tools which are commonly seen (could add more to this but these are the main non-malicious ones I see).

Group 3 covers the text-only and/or basic two-letter + version user-agents.

Finally group 4 contains the "got URL/EMail" filter which scans for either a URL or email identifier in the user-agent.

You might also want to consider checking the "FROM" http header since any half-decent engine or 3rd party crawler seems to ensure they have this set.

- Tony

3:11 pm on Nov 28, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Sept 21, 1999
posts:2141
votes: 0


Human visitors don't request a robots.txt file so, when looking for spiders, I sort my logs by identifying an IP that requests it, then extract all subsequent visits from that IP. Using something similar may serve as an effective "pre-filter," if a visitor asks for robots, then check the spider list, else it's a human visiting.
11:04 am on Nov 29, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Apr 25, 2002
posts:872
votes: 0


Ah yes but what about those robots which don't request robots.txt?

Also what about those users that are inquisitive / annoying enough to request robots.txt?

Assuming you got a few AOL'ers doing that thanks to the "miracle" (or travesty / abomination) that is their load balancing proxies you might find yourself dropping a lot of valid traffic.

- Tony

11:13 am on Nov 29, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 26, 2000
posts:2176
votes: 0


The other problem with using the robots.txt is the fact that even good bots that ask for it, don't ask for it each time they visit.
google can hit your site dozens of individual times in a single day, but only request the robots.txt file on the first visit.
10:18 pm on Dec 2, 2002 (gmt 0)

Full Member

10+ Year Member

joined:May 15, 2002
posts:236
votes: 0


I found the best way is to store IP in database.
10:24 pm on Dec 2, 2002 (gmt 0)

Senior Member

WebmasterWorld Senior Member nick_w is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Feb 4, 2002
posts:5044
votes: 0


I wrote a very simplistic one for just this thing. Just put something from the UA string in the array and you've got it...

[pre]
<?
/* Use this to start a session only if the UA is *not* at search engine
to avoid duplicate content issues with url propagation of SID's */

$searchengines=array("Google", "Fast", "Slurp", "Ink", "ia_archiver", "Scooter");
$is_search_engine=0;
foreach($searchengines as $key => $val) {
if(strstr("$HTTP_USER_AGENT", $val)) {
$is_search_engine++;
}
}

if($is_search_engine==0) { // Not a search engine

/* You can put anything in here that needs to be
hidden from searchengines */
session_start();

} else { // Is a search engine

/* Put anything you want only for searchengines in here */
$foo=$bar;

}

?>
[/pre]

Hope that helps...

Nick