How do you know that something is NOT a spider?

Forum Moderators: coopster

Message Too Old, No Replies

How do you know that something is NOT a spider?

UA of "Mozilla"?

fish_eye

1:51 pm on Jul 21, 2005 (gmt 0)

I would like spiders to find a particular php page but... within the page an email is sent to advise me it's been accessed.

I'd like to NOT get the email everytime the page is crawled so it was suggested I list all the bots I know.

I thought it would be simpler to send the email when the UA starts with "Mozilla". Is it that simple or are there other starts to the UA string for browsers?

What about WAP etc?

(And am I in the correct forum?)

Thanks....

encyclo

3:50 pm on Jul 21, 2005 (gmt 0)

Quite a few spiders (including one Google one) have a user agent string which starts with "Mozilla". On the other hand, some browsers (such as Opera when not spoofing as IE) do not start with "Mozilla". As such, the user agent string is a very unreliable way of deciding whether the visitor is a bot or not.

DanA

4:22 pm on Jul 21, 2005 (gmt 0)

But you can decide that if you find the strings Google, Yahoo, bot , spider, crawler in the user agent, you don't have to send a mail

fish_eye

12:02 am on Jul 22, 2005 (gmt 0)

Some context:

(Firstly - forget about emails - assume I'm getting too many hits now!)

I want to log in a database on my site (in real time) details about searches performed on my site. Searches are done real time against my files (not an index) using a php program. (I don't think I'm allowed to post the name / author / location of the php source but sticky me if you want to know.)

I have a small search box (form) located on every page.

The search box redirects to a results page.

The results page stores the query string and the URI they've come from (which of my pages they were on).

The results page has context ads on it. As a conequence, whenever a media bot hits the page I get a row stored in my database. There are usually a few media bot hits per search.

I would like to stop these rows being stored.

The obvious short term solution is to simply identify the media bots via the UA (they're all from trusted sources) and not store a record when it's them.

However, I've got tons of other bots also hitting my site and, as a result, tons of rows from this.

If I were to set up this spider trap [webmasterworld.com] I would get rid of most of the bad bots (it seems). (The spider trap uses php to modify my .htaccess file.)

This will get rid of most of the garbage but I'm still left with a significant number of rows from good bots.

I'm thinking that I could adapt the spider trap to also record (in another file) the good bots. My search results page would then read this and not store a row (or would store a different row) for these hits.

Whadja reckon?

grandpa

3:35 am on Jul 22, 2005 (gmt 0)

Hi fish_eye

I use an include file where I create an array of known bots. I also use a class, phpSniff, which returns the User Agent string. Then I can compare the User Agent to my known bot array. Here's a shortened version of my include file, checkbot.inc.

<?php
$scriptname='checkbot.inc';
global $my_bot;
global $my_return;
$alluas = array();
$uacount = 0;

// The array is populated with known bot UA's.
$alluas[$uacount] = '/\baipbot\b/i'; $uacount ++;
$alluas[$uacount] = '/\bAltaVista\b/i'; $uacount ++;
$alluas[$uacount] = '/\bYahoo! Slurp\b/i'; $uacount ++;
$alluas[$uacount] = '/\bYahoo-MMCrawler\b/i'; $uacount ++;
$alluas[$uacount] = '/\bZeus\b/i'; $uacount ++;

// Here the class returns the UA
$mysniff->phpSniff($UA='',$settings = false);
$class_vars = get_class_vars(get_class($mysniff));
foreach ($class_vars as $name => $value) {
if ($name == '_browser_info') {
while(list($key,$val) = each($mysniff->_browser_info)) {
if ($key == 'ua') $myua = $val;
}
}
}

$my_test = "";
foreach ($alluas as $name => $value) {
if (preg_match($value, $myua)) {
$mylen = strlen($value);
$mylen2 = $mylen - 7;
$mytotlen = $mylen - $mylen2;
$my_bot = substr($value,3,$mylen2);
$my_test = $value;
}
}

function in_array_multi($needle, $haystack)
{
if(!is_array($haystack)) return $needle == $haystack;
foreach($haystack as $value) if(in_array_multi($needle, $value)) return true;
return false;
}

$my_return = in_array_multi($my_test, $alluas);
?>

In the main script the value of $my_return is tested. I'm using this to determine if a session should be started or not. You can easily base a decision on what to log. If $my_return is true then a bot has been detected.

The only caveat is that new User Agents must be added to the array. You'll need to locate and install the phpSniff class.

Code Sentinel

6:54 am on Jul 22, 2005 (gmt 0)

user agents can be set to anything or nothing and short of putting a check on hits per second I don't think you can distinguish bots from users in any reliable way.

Depending on the weight of the page and a users connection you can still get at least 4(the normal maximum open connections at once per ip for most servers) hits per second from a user(style sheets, graphics, external js, all called immediately after a page is retreived). This doesn't even take into account http pipelining.

I wish I had so much traffic that bots sucking up more was such a problem :)

grandpa

7:17 am on Jul 22, 2005 (gmt 0)

user agents can be set to anything or nothing

True enough. But in the case of valid search bots or Media bots, they will properly identify themselves. Combine that with a list of known IP ranges for the agents and you get a lot closer to identifying any real bot. Of course, anything can be spoofed, so nothing is foolproof. That's not a good enough reason to sit idle and not try, IMO.

fish_eye

2:48 pm on Jul 22, 2005 (gmt 0)

anything can be spoofed

Forgive my ignorance, but does that include IPs? If so doesn't this mean that the aforementioned PHP spider trap [webmasterworld.com] can record a good IP as being a bad one (because the spoof of the good IP will disobey robots.txt and therefore go into the banned list)?

How often do baddies bother to spoof IPs? Are they more inclined to simply spoof (or hide etc) the UA?

jatar_k

4:39 pm on Jul 22, 2005 (gmt 0)

well ips can be spoofed but it isn't the easiest thing to do and if someone is going to that much trouble, forget it, these normal little things won't help protect against them.

all legit spiders also have known ips as well as set user agents. grandpa's solution will be fine.

forget not showing a record, dont even show the search box on the page if it is a bot, then don't even allow them to hit the actual search script at all

vincevincevince

5:03 pm on Jul 22, 2005 (gmt 0)

The easiest method is to check for a referer which is both present and on your site, check $_SERVER['HTTP_REFERER']

Use strpos() and check for your domain name - if it's there then it's either a very clever spider, or it's a human.

Or - you can change your search to use POST not GET. That way you can check for the existance of the $_POST array - and if present, it's a proper click from your search box, not a contextual ad check.

fish_eye

4:23 am on Jul 23, 2005 (gmt 0)

Vince, Thanks for reading the context of my problem and seeing beyond my suggestion. A nice, simple and far more efficient solution - and when I look at the emails I'm being sent by the search results page it's right in front of my nose!

I still like the spider trap (beats having a list of 100+ bots in my .htaccess - some of which must have gone belly up by now) and may play around with the IPs of good bots as well (to what end I don't know yet :))

Many thanks all.

jatar_k

5:06 am on Jul 23, 2005 (gmt 0)

only problem is that referer isn't that reliable, you'll end up banning users.

fish_eye

8:02 am on Jul 23, 2005 (gmt 0)

only problem is that referer isn't that reliable

Surely a local request from one of my own pages is going to be reliable? Do you mean it gets truncated or something? I have seen some odd ones occasionally but not very often.

If you mean it's unreliable if it's from outside then I specifically don't want to store a row for a request from anywhere other than my own pages so all's well.

fish_eye

10:32 am on Jul 23, 2005 (gmt 0)

I just discovered the php spider trap I have been referring to has been updated. The original works in most cases but here is the updated version for those with sites having greater volumes of traffic. (There's also a more extensive background explanation including the need to place a link (hidden or otherwise) to the trap :))

[webmasterworld.com...]

jatar_k, just to clarify, I'm not planning to use referer to ban people - simple to avoid saving spurious (non-)search queries hitting my php search program. The threads got a bit confused because I started to think along the lines of a technique similar to the spider trap (and have now very happily abandoned that approach).

jatar_k

4:29 pm on Jul 23, 2005 (gmt 0)

>> I'm not planning to use referer to ban people

yeah sorry, I used the wrong word there. You will miss lines for people

referer is given to your server by the client, it is reliable about 75% of the time. There are programs that block the referer from being passed and it can easily be faked.

This doesn't mean not to use it but you have to understand that the accuracy is questionable.

fish_eye

7:28 am on Jul 24, 2005 (gmt 0)

it [referer] is reliable about 75% of the time

Really - only that much!?

jatar_k

3:44 pm on Jul 24, 2005 (gmt 0)

it's kind of a rough number but it is about right. Some people get less and some get more.

vincevincevince

4:59 pm on Jul 24, 2005 (gmt 0)

I personally get much more reliability from the referer. Nearer 99%, once spiders are ommitted.

fish_eye

1:06 am on Jul 25, 2005 (gmt 0)

I know it's often blank or "-" but I have always assumed, and some quick analysis seems to confirm, that simply means it's a human coming in from outside (granted it's without a referer, when in fact there may really be one) or it's a bot.

In fact, 40.77% of last month's traffic to the site that I'm currently playing with (without the new improved spider trap :)) was "-" and they mostly seem to be bots.

*~~~~~~~~~~~

I guess the point is (going back to the original theme of this thread):

1) Referer may be a fairly reliable way of determining if the hit on my search results page came from within my site (but not infallible).

2) I should seriously consider using POST (I am using GET) and check the $_POST array as this is likely to be more reliable.

Does that sound right?

jatar_k

5:53 am on Jul 25, 2005 (gmt 0)

yep, sounds right fish_eye