homepage Welcome to WebmasterWorld Guest from 204.236.255.69
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Marketing and Biz Dev / Cloaking
Forum Library, Charter, Moderator: open

Cloaking Forum

    
writing a function for identifying spiders
what will be the algo
jaski




msg:679734
 6:22 am on Nov 28, 2002 (gmt 0)

I am about to write a function (in PHP) which identifies the visitors on my site as spider or human.
What will be a good algo for that.

The reason I need this is that if its a spider I will not start a session. If its a human user I need to track his session.

And the reason for this is that I don't want session variables in my urls indexed by spiders.

The rudimentary function that I have written is simply this
function is_bot($HTTP_USER_AGENT)
{
if(substr_count($HTTP_USER_AGENT, 'bot') == 0)
return true;
else
return false;
}

All this does is check if string 'bot' appears in the useragent ..

This should take care of google atleast...and some others.

Any suggestions. This is a lil bit of cloacking so I am posting here :)

<Edit>Realized after posting that there is a forum dedicated to spiders .. so moderators please feel free to move it there if it belongs there, The word cloacking was all in my head when I was thinking of this so I didn't think twice before posting it here. :)</Edit>

 

Dreamquick




msg:679735
 10:14 am on Nov 28, 2002 (gmt 0)

FWIW I like the following, its pretty effective and rarely turns up false positives (I use it as a check along the lines of "MaybeBot()" for an anti-spam script).

The following are all regular expressions;

"bot"
"robot"
"spider"
"crawler"
"agent"
"validator"

"perl/\d\.\d+"
"python"
"php/\d\.\d"
"java\d\.\d.\d"
"curl/\d\.\d\.\d"
"lwp-request/\d\.\d+"

"^[a-z][a-z]/\d\.\d+$"
"^[a-z]+$"

"@"
"\sat\s"
"http://"

Group 1 should have an obvious origin in that they are all "crawler" derived words.

Group 2 are programming tools which are commonly seen (could add more to this but these are the main non-malicious ones I see).

Group 3 covers the text-only and/or basic two-letter + version user-agents.

Finally group 4 contains the "got URL/EMail" filter which scans for either a URL or email identifier in the user-agent.

You might also want to consider checking the "FROM" http header since any half-decent engine or 3rd party crawler seems to ensure they have this set.

- Tony

DaveAtIFG




msg:679736
 3:11 pm on Nov 28, 2002 (gmt 0)

Human visitors don't request a robots.txt file so, when looking for spiders, I sort my logs by identifying an IP that requests it, then extract all subsequent visits from that IP. Using something similar may serve as an effective "pre-filter," if a visitor asks for robots, then check the spider list, else it's a human visiting.

Dreamquick




msg:679737
 11:04 am on Nov 29, 2002 (gmt 0)

Ah yes but what about those robots which don't request robots.txt?

Also what about those users that are inquisitive / annoying enough to request robots.txt?

Assuming you got a few AOL'ers doing that thanks to the "miracle" (or travesty / abomination) that is their load balancing proxies you might find yourself dropping a lot of valid traffic.

- Tony

WebGuerrilla




msg:679738
 11:13 am on Nov 29, 2002 (gmt 0)

The other problem with using the robots.txt is the fact that even good bots that ask for it, don't ask for it each time they visit.
google can hit your site dozens of individual times in a single day, but only request the robots.txt file on the first visit.

tomasz




msg:679739
 10:18 pm on Dec 2, 2002 (gmt 0)

I found the best way is to store IP in database.

Nick_W




msg:679740
 10:24 pm on Dec 2, 2002 (gmt 0)

I wrote a very simplistic one for just this thing. Just put something from the UA string in the array and you've got it...

[pre]
<?
/* Use this to start a session only if the UA is *not* at search engine
to avoid duplicate content issues with url propagation of SID's */

$searchengines=array("Google", "Fast", "Slurp", "Ink", "ia_archiver", "Scooter");
$is_search_engine=0;
foreach($searchengines as $key => $val) {
if(strstr("$HTTP_USER_AGENT", $val)) {
$is_search_engine++;
}
}

if($is_search_engine==0) { // Not a search engine

/* You can put anything in here that needs to be
hidden from searchengines */
session_start();

} else { // Is a search engine

/* Put anything you want only for searchengines in here */
$foo=$bar;

}

?>
[/pre]

Hope that helps...

Nick

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Marketing and Biz Dev / Cloaking
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved