homepage Welcome to WebmasterWorld Guest from
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Marketing and Biz Dev / Cloaking
Forum Library, Charter, Moderator: open

Cloaking Forum

writing a function for identifying spiders
what will be the algo

10+ Year Member

Msg#: 384 posted 6:22 am on Nov 28, 2002 (gmt 0)

I am about to write a function (in PHP) which identifies the visitors on my site as spider or human.
What will be a good algo for that.

The reason I need this is that if its a spider I will not start a session. If its a human user I need to track his session.

And the reason for this is that I don't want session variables in my urls indexed by spiders.

The rudimentary function that I have written is simply this
function is_bot($HTTP_USER_AGENT)
if(substr_count($HTTP_USER_AGENT, 'bot') == 0)
return true;
return false;

All this does is check if string 'bot' appears in the useragent ..

This should take care of google atleast...and some others.

Any suggestions. This is a lil bit of cloacking so I am posting here :)

<Edit>Realized after posting that there is a forum dedicated to spiders .. so moderators please feel free to move it there if it belongs there, The word cloacking was all in my head when I was thinking of this so I didn't think twice before posting it here. :)</Edit>



WebmasterWorld Senior Member 10+ Year Member

Msg#: 384 posted 10:14 am on Nov 28, 2002 (gmt 0)

FWIW I like the following, its pretty effective and rarely turns up false positives (I use it as a check along the lines of "MaybeBot()" for an anti-spam script).

The following are all regular expressions;





Group 1 should have an obvious origin in that they are all "crawler" derived words.

Group 2 are programming tools which are commonly seen (could add more to this but these are the main non-malicious ones I see).

Group 3 covers the text-only and/or basic two-letter + version user-agents.

Finally group 4 contains the "got URL/EMail" filter which scans for either a URL or email identifier in the user-agent.

You might also want to consider checking the "FROM" http header since any half-decent engine or 3rd party crawler seems to ensure they have this set.

- Tony


WebmasterWorld Senior Member 10+ Year Member

Msg#: 384 posted 3:11 pm on Nov 28, 2002 (gmt 0)

Human visitors don't request a robots.txt file so, when looking for spiders, I sort my logs by identifying an IP that requests it, then extract all subsequent visits from that IP. Using something similar may serve as an effective "pre-filter," if a visitor asks for robots, then check the spider list, else it's a human visiting.


WebmasterWorld Senior Member 10+ Year Member

Msg#: 384 posted 11:04 am on Nov 29, 2002 (gmt 0)

Ah yes but what about those robots which don't request robots.txt?

Also what about those users that are inquisitive / annoying enough to request robots.txt?

Assuming you got a few AOL'ers doing that thanks to the "miracle" (or travesty / abomination) that is their load balancing proxies you might find yourself dropping a lot of valid traffic.

- Tony


WebmasterWorld Senior Member 10+ Year Member

Msg#: 384 posted 11:13 am on Nov 29, 2002 (gmt 0)

The other problem with using the robots.txt is the fact that even good bots that ask for it, don't ask for it each time they visit.
google can hit your site dozens of individual times in a single day, but only request the robots.txt file on the first visit.


10+ Year Member

Msg#: 384 posted 10:18 pm on Dec 2, 2002 (gmt 0)

I found the best way is to store IP in database.


WebmasterWorld Senior Member nick_w us a WebmasterWorld Top Contributor of All Time 10+ Year Member

Msg#: 384 posted 10:24 pm on Dec 2, 2002 (gmt 0)

I wrote a very simplistic one for just this thing. Just put something from the UA string in the array and you've got it...

/* Use this to start a session only if the UA is *not* at search engine
to avoid duplicate content issues with url propagation of SID's */

$searchengines=array("Google", "Fast", "Slurp", "Ink", "ia_archiver", "Scooter");
foreach($searchengines as $key => $val) {
if(strstr("$HTTP_USER_AGENT", $val)) {

if($is_search_engine==0) { // Not a search engine

/* You can put anything in here that needs to be
hidden from searchengines */

} else { // Is a search engine

/* Put anything you want only for searchengines in here */



Hope that helps...


Global Options:
 top home search open messages active posts  

Home / Forums Index / Marketing and Biz Dev / Cloaking
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved