Forum Moderators: open
The reason I need this is that if its a spider I will not start a session. If its a human user I need to track his session.
And the reason for this is that I don't want session variables in my urls indexed by spiders.
The rudimentary function that I have written is simply this
function is_bot($HTTP_USER_AGENT)
{
if(substr_count($HTTP_USER_AGENT, 'bot') == 0)
return true;
else
return false;
}
All this does is check if string 'bot' appears in the useragent ..
This should take care of google atleast...and some others.
Any suggestions. This is a lil bit of cloacking so I am posting here :)
<Edit>Realized after posting that there is a forum dedicated to spiders .. so moderators please feel free to move it there if it belongs there, The word cloacking was all in my head when I was thinking of this so I didn't think twice before posting it here. :)</Edit>
The following are all regular expressions;
"bot"
"robot"
"spider"
"crawler"
"agent"
"validator"
"perl/\d\.\d+"
"python"
"php/\d\.\d"
"java\d\.\d.\d"
"curl/\d\.\d\.\d"
"lwp-request/\d\.\d+"
"^[a-z][a-z]/\d\.\d+$"
"^[a-z]+$"
"@"
"\sat\s"
"http://"
Group 1 should have an obvious origin in that they are all "crawler" derived words.
Group 2 are programming tools which are commonly seen (could add more to this but these are the main non-malicious ones I see).
Group 3 covers the text-only and/or basic two-letter + version user-agents.
Finally group 4 contains the "got URL/EMail" filter which scans for either a URL or email identifier in the user-agent.
You might also want to consider checking the "FROM" http header since any half-decent engine or 3rd party crawler seems to ensure they have this set.
- Tony
Also what about those users that are inquisitive / annoying enough to request robots.txt?
Assuming you got a few AOL'ers doing that thanks to the "miracle" (or travesty / abomination) that is their load balancing proxies you might find yourself dropping a lot of valid traffic.
- Tony
[pre]
<?
/* Use this to start a session only if the UA is *not* at search engine
to avoid duplicate content issues with url propagation of SID's */$searchengines=array("Google", "Fast", "Slurp", "Ink", "ia_archiver", "Scooter");
$is_search_engine=0;
foreach($searchengines as $key => $val) {
if(strstr("$HTTP_USER_AGENT", $val)) {
$is_search_engine++;
}
}
if($is_search_engine==0) { // Not a search engine
/* You can put anything in here that needs to be
hidden from searchengines */
session_start();
} else { // Is a search engine
/* Put anything you want only for searchengines in here */
$foo=$bar;
}
?>
[/pre]
Hope that helps...
Nick