Forum Moderators: coopster
I'd like to NOT get the email everytime the page is crawled so it was suggested I list all the bots I know.
I thought it would be simpler to send the email when the UA starts with "Mozilla". Is it that simple or are there other starts to the UA string for browsers?
What about WAP etc?
(And am I in the correct forum?)
Thanks....
(Firstly - forget about emails - assume I'm getting too many hits now!)
I want to log in a database on my site (in real time) details about searches performed on my site. Searches are done real time against my files (not an index) using a php program. (I don't think I'm allowed to post the name / author / location of the php source but sticky me if you want to know.)
I have a small search box (form) located on every page.
The search box redirects to a results page.
The results page stores the query string and the URI they've come from (which of my pages they were on).
The results page has context ads on it. As a conequence, whenever a media bot hits the page I get a row stored in my database. There are usually a few media bot hits per search.
I would like to stop these rows being stored.
The obvious short term solution is to simply identify the media bots via the UA (they're all from trusted sources) and not store a record when it's them.
However, I've got tons of other bots also hitting my site and, as a result, tons of rows from this.
If I were to set up this spider trap [webmasterworld.com] I would get rid of most of the bad bots (it seems). (The spider trap uses php to modify my .htaccess file.)
This will get rid of most of the garbage but I'm still left with a significant number of rows from good bots.
I'm thinking that I could adapt the spider trap to also record (in another file) the good bots. My search results page would then read this and not store a row (or would store a different row) for these hits.
Whadja reckon?
I use an include file where I create an array of known bots. I also use a class, phpSniff, which returns the User Agent string. Then I can compare the User Agent to my known bot array. Here's a shortened version of my include file, checkbot.inc.
<?php
$scriptname='checkbot.inc';
global $my_bot;
global $my_return;
$alluas = array();
$uacount = 0;
// The array is populated with known bot UA's.
$alluas[$uacount] = '/\baipbot\b/i'; $uacount ++;
$alluas[$uacount] = '/\bAltaVista\b/i'; $uacount ++;
$alluas[$uacount] = '/\bYahoo! Slurp\b/i'; $uacount ++;
$alluas[$uacount] = '/\bYahoo-MMCrawler\b/i'; $uacount ++;
$alluas[$uacount] = '/\bZeus\b/i'; $uacount ++;
// Here the class returns the UA
$mysniff->phpSniff($UA='',$settings = false);
$class_vars = get_class_vars(get_class($mysniff));
foreach ($class_vars as $name => $value) {
if ($name == '_browser_info') {
while(list($key,$val) = each($mysniff->_browser_info)) {
if ($key == 'ua') $myua = $val;
}
}
}
$my_test = "";
foreach ($alluas as $name => $value) {
if (preg_match($value, $myua)) {
$mylen = strlen($value);
$mylen2 = $mylen - 7;
$mytotlen = $mylen - $mylen2;
$my_bot = substr($value,3,$mylen2);
$my_test = $value;
}
}
function in_array_multi($needle, $haystack)
{
if(!is_array($haystack)) return $needle == $haystack;
foreach($haystack as $value) if(in_array_multi($needle, $value)) return true;
return false;
}
$my_return = in_array_multi($my_test, $alluas);
?>
In the main script the value of $my_return is tested. I'm using this to determine if a session should be started or not. You can easily base a decision on what to log. If $my_return is true then a bot has been detected.
The only caveat is that new User Agents must be added to the array. You'll need to locate and install the phpSniff class.
Depending on the weight of the page and a users connection you can still get at least 4(the normal maximum open connections at once per ip for most servers) hits per second from a user(style sheets, graphics, external js, all called immediately after a page is retreived). This doesn't even take into account http pipelining.
I wish I had so much traffic that bots sucking up more was such a problem :)
user agents can be set to anything or nothing
anything can be spoofed
Forgive my ignorance, but does that include IPs? If so doesn't this mean that the aforementioned PHP spider trap [webmasterworld.com] can record a good IP as being a bad one (because the spoof of the good IP will disobey robots.txt and therefore go into the banned list)?
How often do baddies bother to spoof IPs? Are they more inclined to simply spoof (or hide etc) the UA?
all legit spiders also have known ips as well as set user agents. grandpa's solution will be fine.
forget not showing a record, dont even show the search box on the page if it is a bot, then don't even allow them to hit the actual search script at all
Use strpos() and check for your domain name - if it's there then it's either a very clever spider, or it's a human.
Or - you can change your search to use POST not GET. That way you can check for the existance of the $_POST array - and if present, it's a proper click from your search box, not a contextual ad check.
I still like the spider trap (beats having a list of 100+ bots in my .htaccess - some of which must have gone belly up by now) and may play around with the IPs of good bots as well (to what end I don't know yet :))
Many thanks all.
only problem is that referer isn't that reliable
Surely a local request from one of my own pages is going to be reliable? Do you mean it gets truncated or something? I have seen some odd ones occasionally but not very often.
If you mean it's unreliable if it's from outside then I specifically don't want to store a row for a request from anywhere other than my own pages so all's well.
[webmasterworld.com...]
jatar_k, just to clarify, I'm not planning to use referer to ban people - simple to avoid saving spurious (non-)search queries hitting my php search program. The threads got a bit confused because I started to think along the lines of a technique similar to the spider trap (and have now very happily abandoned that approach).
yeah sorry, I used the wrong word there. You will miss lines for people
referer is given to your server by the client, it is reliable about 75% of the time. There are programs that block the referer from being passed and it can easily be faked.
This doesn't mean not to use it but you have to understand that the accuracy is questionable.
In fact, 40.77% of last month's traffic to the site that I'm currently playing with (without the new improved spider trap :)) was "-" and they mostly seem to be bots.
*~~~~~~~~~~~
I guess the point is (going back to the original theme of this thread):
1) Referer may be a fairly reliable way of determining if the hit on my search results page came from within my site (but not infallible).
2) I should seriously consider using POST (I am using GET) and check the $_POST array as this is likely to be more reliable.
Does that sound right?