Filtering Internet noise: That is counting only human visitors

Forum Moderators: open

Message Too Old, No Replies

Filtering Internet noise: That is counting only human visitors

I want to improve my own web logging tools

vite_rts

10:20 am on Aug 19, 2006 (gmt 0)

Hi Guys

I have been experimenting with loggin ips an referers cos i am never entirely confident I understand the figures i get from my 2 analytics packages, tho both are actually provided by well know internet companies,

The issue o how both treat visitors with no referer is a particular sore point

1, shows like 34% of all visitors every day have no referer

the other seems to handle visitors with no referer very differently,

I saw a third package that had a spider filter , but stopped using it when the trial period expired cos I didn't hear anyone else talking about it,

Anyway QUESTION

is anyone offering a reliable list of know spiders an robots either free or cheaply or is there a well know reliable way of building a list,

what does "UA" mean?

I don't want to ban spiders/robots, I just want to filter them from my figures, so I know whats what.

Thanks

Vite_RTS

volatilegx

10:41 pm on Aug 19, 2006 (gmt 0)

Hi vite_rts,

is anyone offering a reliable list of know spiders an robots either free or cheaply or is there a well know reliable way of building a list,

[joseluis.pellicer.org...]
[psychedelix.com...]
[botspot.com...]
[clearwaterbeachcam.com...]
[jafsoft.com...]
[projecthoneypot.org...]
[iplists.com...]
[browsers.garykeith.com...]

what does "UA" mean?

UA is an acronym for "User-Agent", which is a way to identify a visitor to a web site. From Wikipedia:

When Internet users visit a web site, a text string is generally sent to identify the user agent to the server. This forms part of the HTTP request, prefixed with User-agent: or User-Agent: and typically includes information such as the application name, version, host operating system, and language. Bots, such as web crawlers, often also include a URL and/or e-mail address so that the webmaster can contact the operator of the bot.

ronburk

10:54 pm on Aug 19, 2006 (gmt 0)

I don't want to ban spiders/robots, I just want to filter them from my figures, so I know whats what.

A never ending problem. Many don't identify themselves. For a website that gets modest traffic, bots can easily represent a non-trivial percentage of all HTTP GETs.

One pressure point available is that most of the bots who don't identify themselves also don't spread themselves across either time or IP address space very well. Thus, simply filtering out all traffic from any IP address that fetched more than N unique URLs (where N is configurable, 25 works pretty well for me at my website's current size) during the same day tends work pretty good for cleaning up the bottom-feeders, in my experience.

vite_rts

11:08 pm on Aug 19, 2006 (gmt 0)

Thanks for the list volatilegx

Ronburk, do you mean any ip that visits 25 webpages within your website in the same day?

I'll look at that, Thanks

incrediBILL

8:36 pm on Aug 20, 2006 (gmt 0)

You can never tell for sure in post-analysis.

Real-time challenges is the only way to truly identify a bot vs a human as the human responds, the bot just keeps asking for pages.

Trust me on this as I got too many false positives from log file analysis alone because even SPEED isn't an indicator anymore thanks to Firefox pre-fetch, Google Web Accelerator and Blog readers that do things like "open all in tabs" and VOILA! 20 blogs pages are opened in 10 seconds, not a bot though.

[edited by: incrediBILL at 8:38 pm (utc) on Aug. 20, 2006]

ronburk

10:31 pm on Aug 21, 2006 (gmt 0)

Ronburk, do you mean any ip that visits 25 webpages within your website in the same day?

Yup. No magic in "25", that's just the current number which I retune as needed (used to be smaller when I had even less traffic than today).

IncrediBill's rebuttal is true in the general case, but I don't have to solve the general case -- I only have to make something work for a website for which I have intimate knowledge of the normal traffic patterns. I don't have high enough traffic for shared IP addresses (e.g., AOL users) to be a problem. I don't have blogs. My visitor path length is invariably short (>95% are less than 5 GETs of non-image URLs). Nearly all my traffic is free SE traffic or referrals from technical postings in forums. It's very easy to identify virtually all bots with trivial post analysis for my particular website.

In general, this is one of the (few) blessings of not having a high-traffic site. Log analysis assumptions that would break down seriously if you're getting 100,000 unique visitors per day can be highly accurate on a low-traffic website. When you're more in the range of 1,000 unique visitors per day or less, for example, the incorrect assumption that IP address == person will typically be nearly totally correct most of the time (unless there is something weirdly skewed about your traffic sources).

As always with traffic analysis, you have to know exactly what assumptions your analysis relies on, and be alert for cases where those assumptions are violated. Of course, it's pretty easy to examine the logs after being alerted to a previously unknown IP address that hit 30 URLs in a day and determine (by looking at the visitor behavior) that it really was something like a school assignment where all the students were behind the same NAT server.

onlineleben

2:50 pm on Aug 23, 2006 (gmt 0)

I don't want to ban spiders/robots, I just want to filter them from my figures, so I know whats what.

For log analyses I use the free program analog. It is highly configurable by its text based config file.
To overcome the problem that vite_rts raised, I checked the bot/spider names in my logs (from the UA part) and excluded them from analysis. This produced more or less a good analysis of real users.
To track robot/spider behaviour on my sites I told the config file to exclude all UAs and include all bots.
For daily analysis I use the real visitors version and once per month I run the bot version.
Maybe you can try something like this with your current log analysis program.

incrediBILL

7:17 am on Aug 25, 2006 (gmt 0)

I checked the bot/spider names in my logs (from the UA part) and excluded them from analysis.

Just curious how you do that when many of them now call themselves IE, Firefox and Opera?

[edited by: incrediBILL at 7:18 am (utc) on Aug. 25, 2006]

onlineleben

11:50 am on Aug 28, 2006 (gmt 0)

Just curious how you do that when many of them now call themselves IE, Firefox and Opera?

Right, for these cases I don't have a solution with analog.
One way I try to get decent figures is when I see a traffic peak to investigate by looking directly into the logfile what caused the peak. Usually it is one of these bots that hide behind harmless user-agent names like mentioned above. I identify them just by having hundreds of PV by the same IP in a very short time.
I have to admit that this way only works for sites with a few hundred visitors a day, as bot behaviour like the one just described would not be so visible in high traffic sites. Also it it good to know how the traffic you usually get is distributed over the day:
One of my local targeted sites comes up with a perfect bell pattern (no activity at night and increasing over the day, peak during noon and early afternoon) whereas one of my internationally targeted sites sees a more evenly distributed traffic.

Stefan

12:58 pm on Aug 28, 2006 (gmt 0)

One way of identifying real visitors is to look at the images that are taken. I don't know how helpful this is site-wide, but for particular pages you're interested in, just make sure there is a specific image file associated, and see how many times a day it's taken.