Right on - the situation today is a total mess.
In one case, I got so desperate to measure real human traffic that I threw out every IP address that had more than 50 page views in a day. Then I ran the analytics over the revised logs. Only at that point I did I start to see patterns that were more actionable, but I know my heavy handed tactics introduced their own distortions.
I only use server logs to see which bots are trying to hit a site, and to see which search engines are visiting and then monitor how much content they take.
I think that depends on the type of site you have and the type of visitors you attract.
I've been using Google Analytics and added the Quantcast script to my pages last weekend. It will be interesting to see how the UV and PV numbers compare after a month or so.
Google Analytics vs. Quantcast
34,644 Visits 35,231
30,165 Absolute Unique Visitors 29,360
165,603 Pageviews 160,728
4.78 Average Pageviews 4.56
What is weird is that GA gives me credit for more page views but Quantcast lets me have a higher visitor count. I never noticed this before because I only check one stat on a daily basis - site revenue.
When have web stats ever been accurate? Sadly this is nothing new. Trying to discern humans from bots has been a problem from the start of web analytics.
Thanks for bringing the latest difficulty to light. This is just another affirmation that letting absolute numbers go several years back was the right decision.
Then run and hide when the client starts screaming about his very low conversion rates.
|Trying to discern humans from bots has been a problem from the start of web analytics |
Not the bots that play by the rules and properly identify themselves like the major search engines.
Humans used to be the be the dominant species on the web opposed to a few meandering bots but that has completely changed to some small web sites being dominated by more bots than humans.
It's the proliferation of bots that hide under the radar, an ever increasing escalating trend even among the corporate variety, that's quite troublesome.
Week after week the posts keep coming of new webmasters wondering why their stats claim they have all these visitors yet they have no sales, affiliate or adsense revenue.
Someone has to break the bad news to them that those aren't humans on their site.
what of awstats - does it make a somewhat decent attempt at sorting the humans from the chaff?
Had been my impression.
I have worked on products like Google analytics in past. You are damn correct, it can be close but no cigar.
I think I have used sawmill, awstats and a lot more for better log analyzer but there was a always a big gap.
My only suggestion is that if you are not using log files, stop writing to log to save server resources. What I don't like about the analytics, which I liked about raw file is the raw data. Once you have raw data, you can play a lot with it, like making a click path for IPs and then making an average path for certain keywords etc.
Excellent topic incrediBILL and it couldn't have come at a better time.
Great advice too! GA does a fairly decent job for small to medium sites. Add to that a bit of log file analysis, maybe a custom application to track specific logfile activity and you are one step ahead of the Jones'.
I'm looking forward to more discussion on this as I'm getting ready to Ban 75% of the Planet on 2009-01-01 for one site as an experiment for the year. You'll have to knock and wait for someone to answer the door if you are not on the Allowed List. ;)
By the way, I think the 2009 year will be an eye opener for many at which time these topics will be Front Page every day. ;)
|what of awstats - does it make a somewhat decent attempt at sorting the humans from the chaff? |
I hope you have a locked down version. < Is there such a thing?
robots.pm (the robots list for awstats) is just a big long list of known bots and it seems they have some mechanisms to detect some additional details, but I'm not so sure it would be real good at detecting the types I'm talking about here that don't want to be detected in the first place.
None of the bots in their list would ever skew the stats of humans vs. bots.
If you merely sort out everything that isn't MSIE/FF/OPERA you have a good start and you can do that with a whitelist, not that big list of bots, but their list provides links to their owners sites which is nice.
However, the bots I'm talking about always claim to be MSIE/FF/OPERA so you have to have ranges of hosting data centers and such to filter out all of the other automated noise.
Once you've done that, then you have to filter out rogue activity which isn't always so obvious, things that aren't even stored in the log files.
For instance, a scraper using AOL will hop from IP to IP on a timer and the only way to really tell it's the same scraper is if that scraper is accepting your cookie which many do these days.
You don't find cookie data in log files.
I could go on and on, but there's quite a bit of information that you don't see in a post-mortem analysis simply because the data retention would be astronomical and so would the time to process it all.
THE SKY IS FALLING THE SKY IS FALLING...
So what's the solution?
Take a proactive approach and do as much as you can to lock things down. Let's talk more about banning specific IP ranges that we know to be the major abusers of our local networks. I'm almost certain that all of us here in the US are getting pounded by the same damn bots and it sure would be nice to just press a button and be done with them. I know, I know, some of you already do that. Hey, I want about 10 of those buttons, where can I get one?
I pretty much gauge my traffic on Google's Adsense impressions lately. They are pretty consistent. Even when my visitors are bouncing all over the board due to bots, my Adsense stats remain my one constant North star... ;-)
|For instance, a scraper using AOL will hop from IP to IP on a timer and the only way to really tell it's the same scraper is if that scraper is accepting your cookie which many do these days. |
sort of true IMO....scraper (app) could accept the cookie and immediately delete it (automatically). Then on a next page visit webmaster would not know for sure if it is same person/scraper or "regular" AOL user
I'm using a server log analyzer from a Ukrainian software maker; this does a very good job at filtering out most illegitimate traffic, as well as hotlinkers. It's still higher than the Adsense count (as Adsense is not present on all the pages), but pretty much in line and consistent.
I agree on the notion that the rawness of the raw data is what makes log files so attractive to analyze.
I gave you my best solution, Google Analytics.
Somewhere between GA and AWSTATS lies the truth.
|scraper (app) could accept the cookie and immediately delete it |
Many scrapers tend to keep the cookie so sites like WebmasterWorld that require cookies won't punt them to the curb.
|I agree on the notion that the rawness of the raw data is what makes log files so attractive to analyze. |
Ain't that the truth!? I always thought of raw logfiles as stuff that the programmers looked at every now and then. Or my Server Admins were screaming at me because they were too damn large. That all changed a while back. I scour those puppies now. I have an app to do it for me and it does it in real time. Keeps us on our toes too.
GA has actually helped some of my smaller clients to uncover some unusual stuff. We typically don't watch things like many on a daily basis. Nah, that is boring, it really is. Once a site reaches certain levels, you tend to look more at overall trends on a weekly and/or monthly basis and then dig for the pearls when time permits. Small budgets, minimal time, it has to be spread out elsewhere. Mining anayltics is a tedious task and requires a certain skill that I have yet to really grab hold of, I'm learning though.
For example, I have one client who has a publication. It just happens to utilize a three letter word that is very popular in another industry. Someone ran a little experiment against their domain and hotlinked to the publication with related terms and they were sure to use a few of the exact terms from the publication title. Want to see your traffic go nuts for a little while? That term ranked number one for an entire month. It actually increased sales of the pub too just by default, go figure. Either way, in looking under the microscope at the effects, they were not good and not worth the added revenue, that one was blocked permanently. Within 30 days or so, things were back in order. I watched that term vanish from the list.
Out of all the analytics packages out there, GA can't be beat. You pretty much sign your soul over to the devil and in return you get something that many pay a pretty penny for. Now, Omniture stats are a sight to behold. Especially when there are high volume numbers. Try analyzing the logfiles of a site that large. Anyone here do that? I mean, are you micro-managing more than a million pages?
|Sadly, there is no accurate analytics solution at this time. |
I wasn't real happy to read that. Are we absolutely sure of that? With all this technology we don't have a solution that is "almost" accurate? ;)
Almost accurate isn't accurate.
I'm looking at less than 1,000 pages monthly getting hit more than a million times a month. Log files are important... as well as filtering out all the bots and scrappers... and it looks like I get about 200 uniques a month. Which is okay... means more than I thought as NEWBIES to the site, which is very narrow niche.
But I would like to narrow that metric down to REAL USERS if I could.
And I can't. Not yet.
Oh, no. Heaven is not falling.
If the internet was infested with bot traffic as the OP claims, then there would be no sites with 5 hits per day. :-)
|If the internet was infested with bot traffic as the OP claims, then there would be no sites with 5 hits per day. :-) |
Even bots have to have a way to find a site before spidering it. "Build it and they will come" doesn't even work for scrapers, I'm afraid ;)
|Somewhere between GA and AWSTATS lies the truth. |
bingo. my fave combination.
there's one major upside to server logs though:
reporting all requests and traffic for all files
... for you can't put urchin code into an image
And won't notice access to areas you didn't use js tracking on.
hotlinking / download trends, media files, security issues... unless you serve every single request through jump pages, you won't know what takes up MOST of your data traffic for some sites. And even then you might not notice accesses or break-in attempts to sections you thought to be safe.
So I'd rather say that your agent log is complete fiction...
...traffic reports are not.
ugh... could they be any more real
|Almost accurate isn't accurate. |
"Good enough" is good enough.
Folks, bots and humans are going to visit you, just look for the spikes in traffic to chase robots. Work on sorting out or blocking the robots, but donít spend all of your days on it. Focus on attracting humans, for most of us the money is in humans not blocking robots.
[edited by: Edge at 12:18 pm (utc) on July 16, 2008]
How many of these rogue bots and scrapers are grabbing the images from the page? Would a 1x1 transparent GIF on the page give you a reasonable count of real "eyeballs"?
1) K-Meleon with JS and Java disabled
2) Firefox with prefetch disabled. JS is enabled, but still there's no Flash installed. Some sites that I visit often are bookmarked only in Firefox, because I already know and trust those sites, and know that JS is required for some functions on those sites.
3) Explorer with JS and Flash.
I do my daily "news" surfing with K-Meleon. If I come across something that needs JS and I have to see it, I paste the URL into Firefox.
If I must see some Youtube or news video (maybe a couple times a week) I reload the page in Explorer.
I don't update my browsers; my versions are fairly old. I've never used anti-virus, and never had any infection problems on my Windows XP. Once a week I religiously back up recent files using xcopy in a command window, and copy them to another computer or flash stick. I also export the registry to a backup file, and also set a new restore point on the XP. All cookies on all browsers are killed automatically several times a day, with a routine that was added to a different program that I have to use several times a day anyhow.
[edited by: ralent at 2:23 pm (utc) on July 16, 2008]
| This 55 message thread spans 2 pages: 55 (  2 ) > > |