Forum Moderators: DixonJones

Message Too Old, No Replies

What do you filter out?

Let's compile a list of everything you should filter out in your log files

         

Mikkel Svendsen

10:27 am on Sep 15, 2002 (gmt 0)

10+ Year Member



I often stumble into new things that deserve to be filtered out - noise, that clutter my webserver log file analysis. I would like to hear what you filter out to get down to the "real" numbers ...

This is just a short list with some of the things, to start out with ...

  • Your own IP(s)
  • IPs of (your own) externally requesting systems or personal
  • Spiders (if your are doing visitor analysis)
  • Non-human surfers (if your are doing visitor analysis - whatever way you chose to do it … Sometimes I simply filter out all hits with no referrer, as that will filter out most agents and spiders)
  • All hit's to graphic files (unless you are doing server load balancing analysis and that sort of stuff)
  • If you are using Google's hosted site search you probably want to filter out searches from that referrer if you are analysing search engine traffic
  • Hits to administrative files or test versions of the site
  • Hits to redirect files on your server (often show up as two page views if you don't filter out the redirecting file)

    … Then comes shopping cart systems, dynamic pages and especially websites using website IDs – but that’s another chapter :)

    Often when I work with clients that have done their own analysis for some time I find that they have not been filtering out what they should and they have mingled with the standard settings for session timeout. I've seen cases in which they turned down the session timeout to 5 minutes – and I tell you, they had a lot of user sessions :) However, they got very disappointed once I was done and they had to report the "real" numbers to the boss …

    Do you often have this problem too? And how do you deal with clients when you have to tell them that they only have half the visitors they thought?

  • Ash

    9:47 am on Sep 23, 2002 (gmt 0)

    10+ Year Member



    Great post Mikkel. I'm just getting into stats at the moment and it seems to me that there are questions that we need to ask our clients and then try to answer them through includes and excludes within our particular stats programs. I have read on the board that one particular person offers a client a number of questions that they can answer and then goes about answering the ones that they have picked, this seems to be a good way to go about offering stats analysis. I also think that what we include and exclude will be different for each individual client and business area.

    lazerzubb

    9:49 am on Sep 23, 2002 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    All different attacks on the server.
    Sometimes they are huge amount of different types which appears in the logfiles.

    agentwebranking

    9:58 am on Sep 23, 2002 (gmt 0)

    10+ Year Member



    Hi Mikkel,

    What do you mean by:
    "If you are using Google's hosted site search you probably want to filter out searches from that referrer if you are analysing search engine traffic"

    Agentwebranking

    tedster

    10:26 am on Sep 23, 2002 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    If you use framesets in your site, you may want to filter out the purely navigation frames and only count true content content frames. This is easiest if you use a naming convention so you can use wildcards in the filter.

    I have one client whose "true" page views were inflated by about 25% by counting all those hits to framed navigation pages.

    Agentwevranking -
    The problem if you use a google hosted site search is that your own website's 'Site Search" hits may show a "google" referer (including a search phrase), but the traffic is not true search engine traffic - it was already on your site.

    Iguana

    10:52 am on Sep 23, 2002 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    how about

    .ccs files
    .js files
    favicon.ico (although this deserves a special report in itself!)

    My filter just says only .htm files and then filters out various spiders from there

    Sinner_G

    11:55 am on Sep 23, 2002 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    robots.txt

    sophtware

    1:41 pm on Sep 24, 2002 (gmt 0)

    10+ Year Member



    "...Sometimes I simply filter out all hits with no referrer, as that will filter out most agents and spiders)"

    Note: This will also filter out anyone that starts their web browser and enters the url directly. So if you do any off-line advertising (like radio, tv, magazine, etc...) you will lose those visits as well.

    Mikkel Svendsen

    6:55 pm on Sep 24, 2002 (gmt 0)

    10+ Year Member



    yes, and bookmarks :)

    ppg

    11:53 am on Sep 25, 2002 (gmt 0)

    10+ Year Member



    automated site grabbers (HTTrack recently badly skewed my stats for the day it visited)

    incywincy

    11:55 am on Sep 25, 2002 (gmt 0)

    10+ Year Member



    hits due to publishing?

    Mikkel Svendsen

    2:08 pm on Sep 28, 2002 (gmt 0)

    10+ Year Member



    incywincy, I am not sure what you mean by "hits due to publishing"? :)

    danny

    5:42 am on Sep 29, 2002 (gmt 0)

    WebmasterWorld Senior Member 10+ Year Member



    Here's my analog agent exclude list:

    BROWEXCLUDE VayalaCreep*
    BROWEXCLUDE DoCoMo*
    BROWEXCLUDE MyBrowser*
    BROWEXCLUDE WebStripper*
    BROWEXCLUDE *Muscat*
    BROWEXCLUDE "Mozilla/3.0 (compatible)"
    BROWEXCLUDE *Zeal*
    BROWEXCLUDE divine
    BROWEXCLUDE *Zealbot*
    BROWEXCLUDE MFC*
    BROWEXCLUDE *WEBsaver*
    BROWEXCLUDE Mozzilla*
    BROWEXCLUDE Zao*
    BROWEXCLUDE pavuk*
    BROWEXCLUDE Rumours-Agent
    BROWEXCLUDE OmniWeb
    BROWEXCLUDE Verity*
    BROWEXCLUDE User-agent:
    BROWEXCLUDE divine*
    BROWEXCLUDE Lachesis
    BROWEXCLUDE SuperGet*
    BROWEXCLUDE *Check*
    BROWEXCLUDE DSPAFCKS
    BROWEXCLUDE dCSbot*
    BROWEXCLUDE LLUPDA*
    BROWEXCLUDE *eradex*
    BROWEXCLUDE EmailWolf*
    BROWEXCLUDE Gigabot*
    BROWEXCLUDE *Libro*
    BROWEXCLUDE SlySearch*
    BROWEXCLUDE Terminator
    BROWEXCLUDE DiaGem*
    BROWEXCLUDE ASPseek*
    BROWEXCLUDE GetRight*
    BROWEXCLUDE LLUPDATECTRL
    BROWEXCLUDE DBrowse*
    BROWEXCLUDE RPT*
    BROWEXCLUDE ADB*
    BROWEXCLUDE Pita*
    BROWEXCLUDE Eye*
    BROWEXCLUDE Swish*
    BROWEXCLUDE Scoot*
    BROWEXCLUDE indexer*
    BROWEXCLUDE PSurf*
    BROWEXCLUDE *Whizbang*
    BROWEXCLUDE lwp*
    BROWEXCLUDE Pompos*
    BROWEXCLUDE *Ctrl*
    BROWEXCLUDE *.NET*CLR*1.0.2914)
    BROWEXCLUDE *hhjhj@yahoo.com
    BROWEXCLUDE combine*
    BROWEXCLUDE *takoy
    BROWEXCLUDE ASPSeek*
    BROWEXCLUDE *NEWT*
    BROWEXCLUDE *API*
    BROWEXCLUDE webrank*
    BROWEXCLUDE target*
    BROWEXCLUDE webfetch*
    BROWEXCLUDE HBWSTUINYJC
    BROWEXCLUDE SIE-*
    BROWEXCLUDE HttpAuth*
    BROWEXCLUDE OFCGASG
    BROWEXCLUDE *ELNSB50*
    BROWEXCLUDE sitecheck*
    BROWEXCLUDE www.webwombat.com.au
    BROWEXCLUDE Benjamin-Bandicoot*
    BROWEXCLUDE nabot*
    BROWEXCLUDE knowledge*
    BROWEXCLUDE *SYMPA*
    BROWEXCLUDE gigabaz*
    BROWEXCLUDE flunky
    BROWEXCLUDE testing*
    BROWEXCLUDE Bublos*
    BROWEXCLUDE LWP*
    BROWEXCLUDE Robo*
    BROWEXCLUDE ASSORT*
    BROWEXCLUDE Jack
    BROWEXCLUDE *hverify*
    BROWEXCLUDE ImageCollector
    BROWEXCLUDE *HTTrack*
    BROWEXCLUDE *nfoseek*
    BROWEXCLUDE Webdup*
    BROWEXCLUDE HBLFMN
    BROWEXCLUDE bumblebee*
    BROWEXCLUDE curl*
    BROWEXCLUDE iQuest*
    BROWEXCLUDE ScoutAbout
    BROWEXCLUDE webf_bgb
    BROWEXCLUDE *....../1.0*
    BROWEXCLUDE Mozilla/4.01
    BROWEXCLUDE WebReaper*
    BROWEXCLUDE WhereEverythingIs.com
    BROWEXCLUDE StressTest
    BROWEXCLUDE *Engine*
    BROWEXCLUDE RepoMonkey*
    BROWEXCLUDE MultiText*
    BROWEXCLUDE Harvest*
    BROWEXCLUDE moget*
    BROWEXCLUDE *polybot*
    BROWEXCLUDE *gozilla*
    BROWEXCLUDE *NetscapeOnline.co.uk*
    BROWEXCLUDE *Magnet*
    BROWEXCLUDE *Indy*
    BROWEXCLUDE *QXW*
    BROWEXCLUDE ParaSITE*
    BROWEXCLUDE Cartographer*
    BROWEXCLUDE WFARC
    BROWEXCLUDE *www.webtop.com*
    BROWEXCLUDE womderer
    BROWEXCLUDE WebCopier*
    BROWEXCLUDE teoma*
    BROWEXCLUDE antibot*
    BROWEXCLUDE MIIxpc*
    BROWEXCLUDE canifindthis*
    BROWEXCLUDE webcollage*
    BROWEXCLUDE CGLConnection
    BROWEXCLUDE EbiNess*
    BROWEXCLUDE Bjaaland*
    BROWEXCLUDE DittoSpyder
    BROWEXCLUDE webbandit*
    BROWEXCLUDE *Informant*
    BROWEXCLUDE *BunnySlippers*
    BROWEXCLUDE *PBIE41298*
    BROWEXCLUDE TITAN*
    BROWEXCLUDE *experiment*
    BROWEXCLUDE WebCopier
    BROWEXCLUDE sprocket*
    BROWEXCLUDE WebCraft*
    BROWEXCLUDE SAKHR*
    BROWEXCLUDE VCI*
    BROWEXCLUDE eCatch*
    BROWEXCLUDE gazz*
    BROWEXCLUDE *davesengine*
    BROWEXCLUDE *AvantGo*
    BROWEXCLUDE fetch*
    BROWEXCLUDE *sureseeker*
    BROWEXCLUDE tv*
    BROWEXCLUDE TE*
    BROWEXCLUDE *petersnews*
    BROWEXCLUDE roach*
    BROWEXCLUDE DigOut4U
    BROWEXCLUDE cosmos*
    BROWEXCLUDE *ip3000.com
    BROWEXCLUDE *Webinator*
    BROWEXCLUDE Spinne*
    BROWEXCLUDE Internet-Html*
    BROWEXCLUDE *Jeeves*
    BROWEXCLUDE NG*
    BROWEXCLUDE *Hotbar*
    BROWEXCLUDE Mozilla/3.Mozilla/2.01*
    BROWEXCLUDE *ZyBorg*
    BROWEXCLUDE *htdig*
    BROWEXCLUDE TURBOEXPLORER
    BROWEXCLUDE Offline*
    BROWEXCLUDE *libwww*
    BROWEXCLUDE Java*
    BROWEXCLUDE asterias*
    BROWEXCLUDE UtilMind*
    BROWEXCLUDE *Link*
    BROWEXCLUDE WebGather*
    BROWEXCLUDE Aranha*
    BROWEXCLUDE appie*
    BROWEXCLUDE *Getweb*
    BROWEXCLUDE NewsTicker*
    BROWEXCLUDE *Control*
    BROWEXCLUDE WebFountain*
    BROWEXCLUDE IL-xml-harvester*
    BROWEXCLUDE Crescent*
    BROWEXCLUDE *Sidewinder*
    BROWEXCLUDE *larbin*
    BROWEXCLUDE Hubater*
    BROWEXCLUDE DIIbot*
    BROWEXCLUDE Teleport*
    BROWEXCLUDE *Grabber*
    BROWEXCLUDE *Katriona*
    BROWEXCLUDE ok
    BROWEXCLUDE *LEIA*
    BROWEXCLUDE *ooglebot*
    BROWEXCLUDE *Mercator*
    BROWEXCLUDE *NetCarta_WebMapper*
    BROWEXCLUDE *Wget*
    BROWEXCLUDE *Arachnia*
    BROWEXCLUDE *Phantom*
    BROWEXCLUDE *www.WebWombat.com.au*
    BROWEXCLUDE *MuscatFerret*
    BROWEXCLUDE *WebCapture*
    BROWEXCLUDE *Ultraseek*
    BROWEXCLUDE *grabClient*
    BROWEXCLUDE *ia_archiver*
    BROWEXCLUDE *Slurp*
    BROWEXCLUDE *cooter*
    BROWEXCLUDE *rawl*
    BROWEXCLUDE *pider*
    BROWEXCLUDE *obot*
    BROWEXCLUDE *Bot*
    BROWEXCLUDE *earch*
    BROWEXCLUDE *xyro*
    BROWEXCLUDE *ExtractorPro*
    BROWEXCLUDE *Gulliver*
    BROWEXCLUDE *EmailSiphon*
    BROWEXCLUDE Katriona
    BROWEXCLUDE *Explorer/0.1*
    BROWEXCLUDE *hitwise*