Welcome to WebmasterWorld Guest from 54.159.50.111

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

Spiders messing with my stats

Spiders messing with my stats

     
3:55 pm on Nov 4, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 15, 2004
posts:192
votes: 0


Hi all,

I'm tracking the number of times a page has been viewed by my visitors, using the following code:

<?php
$ip = $_SERVER["REMOTE_ADDR"];
if ($ip == '66.249.66.242')
{ $write_rec = 'N'; }

if ($write_rec!= 'N')
( ** Write record to database ** }
?>

The '66.249.66.242' IP is one of Google's spiders. Obviously this is a bad way of doing things because I would need the IP of every spider to successfully differentiate between a human visitor and a spider.

Whats the best way of doing this?

chirp

4:06 pm on Nov 4, 2005 (gmt 0)

Inactive Member
Account Expired

 
 


Something like this?

if(!preg_match("/Googlebot/", $_SERVER['HTTP_USER_AGENT'])) { 
# not Googlebot
}

;)

4:18 pm on Nov 4, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 15, 2004
posts:192
votes: 0


Hi,

Thanks for the reply. What if it's yahoo bot, msn etc etc. I'm then back in a similar situation or is this the best I can do?

4:30 pm on Nov 4, 2005 (gmt 0)

Administrator

WebmasterWorld Administrator jatar_k is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:July 24, 2001
posts:15755
votes: 0


you could have an array of bot names then compare the user agent to the array using in_array()
4:40 pm on Nov 4, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 15, 2004
posts:192
votes: 0


Would it be safe to check the HTTP_USER_AGENT and not write a record if any of the following are found:

crawl, bot, slurp, spider, seek, collect, track

I've just been looking through a list of HTTP_USER_AGENT and I think these may work, I realise this isn't going to be perfect though.

4:44 pm on Nov 4, 2005 (gmt 0)

Administrator

WebmasterWorld Administrator jatar_k is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:July 24, 2001
posts:15755
votes: 0


could work, seems roughly ok

using user agent is never an exact science

though this makes me think, why are you doing this? Is it to be displayed on the page?

or is it some kind of tracking? if tracking then you would be much better off using a stats package and your raw logs.

5:18 pm on Nov 4, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Mar 15, 2004
posts:192
votes: 0


Hi,

It is for tracking but only for my reference. My site is built from a database, so for example I have the following in the database:

widget0001 ¦ widgetinformation ¦ view count
widget0002 ¦ widgetinformation ¦ view count
...
widget9323 ¦ widgetinformation ¦ view count

I have a stats package but I'd like this count just so that I know which is the most popular page when I'm reading my raw data.

5:30 pm on Nov 4, 2005 (gmt 0)

Administrator

WebmasterWorld Administrator jatar_k is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:July 24, 2001
posts:15755
votes: 0


you could do some baseline work

insert for each hit and then see what needs to be filtered once you have enough base data

use a table that stores

ip
pagename
user agent

then you can select counts from mysql and start excluding bots that you find in there, you could also maintain a botlist as well to load into your array for comparison and tighten up your pageview counting over time

9:59 am on Nov 5, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 11, 2003
posts:71
votes: 0


Why not filter positive? There are only a few browsers out there, but many bots. So go for

If {$_SERVER['HTTP_USER_AGENT'] = Opera or IE or Firefox or safari or ..){
LogIt
}else{
NoLog
}

10:03 am on Nov 5, 2005 (gmt 0)

Junior Member

10+ Year Member

joined:Jan 11, 2003
posts:71
votes: 0


I have a stats package but I'd like this count just so that I know which is the most popular page when I'm reading my raw data.

Anyway I suggest not to go for perfection when stats are concerned. Even ignoring the bots, the stats will probably be pretty accurate for your above question.