Welcome to WebmasterWorld Guest from 54.226.194.180

Forum Moderators: coopster & jatar k

Message Too Old, No Replies

Spiders messing with my stats

Spiders messing with my stats

     

PumpkinHead

3:55 pm on Nov 4, 2005 (gmt 0)

10+ Year Member



Hi all,

I'm tracking the number of times a page has been viewed by my visitors, using the following code:

<?php
$ip = $_SERVER["REMOTE_ADDR"];
if ($ip == '66.249.66.242')
{ $write_rec = 'N'; }

if ($write_rec!= 'N')
( ** Write record to database ** }
?>

The '66.249.66.242' IP is one of Google's spiders. Obviously this is a bad way of doing things because I would need the IP of every spider to successfully differentiate between a human visitor and a spider.

Whats the best way of doing this?

chirp

4:06 pm on Nov 4, 2005 (gmt 0)

10+ Year Member



Something like this?

if(!preg_match("/Googlebot/", $_SERVER['HTTP_USER_AGENT'])) { 
# not Googlebot
}

;)

PumpkinHead

4:18 pm on Nov 4, 2005 (gmt 0)

10+ Year Member



Hi,

Thanks for the reply. What if it's yahoo bot, msn etc etc. I'm then back in a similar situation or is this the best I can do?

jatar_k

4:30 pm on Nov 4, 2005 (gmt 0)

WebmasterWorld Administrator jatar_k is a WebmasterWorld Top Contributor of All Time 10+ Year Member



you could have an array of bot names then compare the user agent to the array using in_array()

PumpkinHead

4:40 pm on Nov 4, 2005 (gmt 0)

10+ Year Member



Would it be safe to check the HTTP_USER_AGENT and not write a record if any of the following are found:

crawl, bot, slurp, spider, seek, collect, track

I've just been looking through a list of HTTP_USER_AGENT and I think these may work, I realise this isn't going to be perfect though.

jatar_k

4:44 pm on Nov 4, 2005 (gmt 0)

WebmasterWorld Administrator jatar_k is a WebmasterWorld Top Contributor of All Time 10+ Year Member



could work, seems roughly ok

using user agent is never an exact science

though this makes me think, why are you doing this? Is it to be displayed on the page?

or is it some kind of tracking? if tracking then you would be much better off using a stats package and your raw logs.

PumpkinHead

5:18 pm on Nov 4, 2005 (gmt 0)

10+ Year Member



Hi,

It is for tracking but only for my reference. My site is built from a database, so for example I have the following in the database:

widget0001 ¦ widgetinformation ¦ view count
widget0002 ¦ widgetinformation ¦ view count
...
widget9323 ¦ widgetinformation ¦ view count

I have a stats package but I'd like this count just so that I know which is the most popular page when I'm reading my raw data.

jatar_k

5:30 pm on Nov 4, 2005 (gmt 0)

WebmasterWorld Administrator jatar_k is a WebmasterWorld Top Contributor of All Time 10+ Year Member



you could do some baseline work

insert for each hit and then see what needs to be filtered once you have enough base data

use a table that stores

ip
pagename
user agent

then you can select counts from mysql and start excluding bots that you find in there, you could also maintain a botlist as well to load into your array for comparison and tighten up your pageview counting over time

AcsCh

9:59 am on Nov 5, 2005 (gmt 0)

10+ Year Member



Why not filter positive? There are only a few browsers out there, but many bots. So go for

If {$_SERVER['HTTP_USER_AGENT'] = Opera or IE or Firefox or safari or ..){
LogIt
}else{
NoLog
}

AcsCh

10:03 am on Nov 5, 2005 (gmt 0)

10+ Year Member



I have a stats package but I'd like this count just so that I know which is the most popular page when I'm reading my raw data.

Anyway I suggest not to go for perfection when stats are concerned. Even ignoring the bots, the stats will probably be pretty accurate for your above question.
 

Featured Threads

Hot Threads This Week

Hot Threads This Month