homepage Welcome to WebmasterWorld Guest from 54.226.213.228
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Visit PubCon.com
Home / Forums Index / WebmasterWorld / Website Analytics - Tracking and Logging
Forum Library, Charter, Moderators: Receptional & mademetop

Website Analytics - Tracking and Logging Forum

This 55 message thread spans 2 pages: < < 55 ( 1 [2]     
Why Your Server Log Files are Complete Fiction
War on Web Sites Claims Analytics Casualty
incrediBILL




msg:3699290
 10:17 pm on Jul 15, 2008 (gmt 0)

If your online business relies on analyzing your web server logs you're already in trouble.

Many online services pretend to be humans by hiding as browsers in order to gain access to your server unnoticed so you can't stop them and as a side effect they are added to the visitor count on your server log analytics.

Just recently there was a massive outcry about AVG [webmasterworld.com] doing this deceptive practice, which has now almost stopped, but many others continue doing this unabated.

Some of the companies that are skewing your analytics include Picscout [webmasterworld.com], Cyveillance [webmasterworld.com], Munax [webmasterworld.com], WebSense [webmasterworld.com], and many more, too numerous to mention here, an increasingly long list growing daily.

Not only companies are doing this but scrapers, spam harvesters, spambots, botnets and all sorts of other illicit activities trying to avoid being stopped as well.

To put it mildly, your server log analytics are complete and utter fiction.

That means clients who think these server side stats are actually meaningful are probably making life miserable for their web designers, SEOs and marketing staff because it's obvious they're doing a bad job considering the low or decreasing conversion rates.

Is there a solution?

Short of writing very complicated software that can detect and filter out all these numerous sources of deceptive activity, switch to javascript based analytics such as Google Analytics. The downside of javascript analytics are that many people are now blocking javascript for security reasons and ad blockers are stopping javascript based tracking systems but it's a small percentage. The upside is that most automated tools don't use javascript so at least you're not counting fake hits, not many anyway.

So your options are slightly under-counting with javascript based solutions or massively over-counting with raw log file analysis.

Sadly, there is no accurate analytics solution at this time.

You can get close, but no cigar.

The best advice at this time is to use both javascript based and raw log file based analytics and you know the truth lies somewhere in the middle and it's closer to the low count than the high.

[edited by: incrediBILL at 10:22 pm (utc) on July 15, 2008]

 

carguy84




msg:3699859
 2:18 pm on Jul 16, 2008 (gmt 0)

Urchin (software) analyzes the raw log files, but it also uses a piece of javascript to track traffic. So you can track all your pageviews, and then setup a different report to see the traffic to your objects(images, CSS files, JS files...).

janharders




msg:3699867
 2:37 pm on Jul 16, 2008 (gmt 0)

There are IP-Lists of all major bots. Of course they're not including small private-run bots, but it's a start. Then, of course, you could include more sophisticated bot-recognition, for example not only looking at the useragent but also at the HTTP-Version, Accept-Encoding etc pp, which most bot-builders are too lazy (or uninformed?) to adjust when faking useragents of browsers. Then, of course, traffic-spiking and speed is an issue. A hiding (thus slow) bot and a very busy user could both request a hundred pages on a given day, but the user won't usually do them in a row, finishing them in 10 minutes, but the bot might.

for bots that are specifically designed to extract your content from a database search, you could check referers and see if they match, which, of course, is quite a bit of work to do in a script (aka if he did not do a search that includes result 12345, he may be a bot if he requests show.php?id=12345). For AOL-users who use the internal browser and thus sit behind proxies, that might also be way to tell them apart if you have a large site with a lot of pages. If you combine session timeouts with analysis of possible clickstreams, you might be able to see that one AOL-user went from this page to that page while the other one is operating in a different area of the site. that, of course, requires your software to know about the linkstructure within your site.
add cookies, useragent-tracking etc and it'll get you closer to the truth.

Back in the days, we tried a dirty little trick using basic auth and the feature to include it in urls. basically, every one who arrived got redirected to [uniquesessionid:whatever@mywebsite.tld...] thus sending this "username" and password in every request. too bad, shortly after we deployed it, browsers changed their default behavior of silently accepting those urls due to security issues.
another thing we tried was sessionids in subdomains, which were tracked, generated etc by a mod_perl-handler as PerlFixupHandler. that was nice because it allowed us static html, onsite-absolute links like /files/bar.html would still work, thus no dynamic content-overhead, not break the session tracking and it was quite accurate. it was given up when we finally rewrote the whole site, losing the frames. that kind of session-tracking was mainly born because we wanted people to jump in from search engines, giving them a frameset but show the right page in the content-frame, so anyone who had a valid session would also have a frameset, everyone else would get a session and a frameset.

poppyrich




msg:3700006
 4:52 pm on Jul 16, 2008 (gmt 0)

@incrediBill

The average with javascript disabled is below 5%, probably closer to 1-2% for most sites.

You talking real users or crawlers or both or what?

And, how did you measure this?

I have a deep background in user support including big companies and outside of companies or departments with high-security needs, I've never seen it turned off and I've asked others with similar backgrounds if they've ever seen it turned off and their experience is the same as mine.

In the corporate world, the average intranet would simply not work properly if script was disabled in the user agents.

And nobody is going to get me to believe that the percentage of Windows users using IE with script turned off is more than .00001% or some such negligible number.

wheel




msg:3700151
 7:37 pm on Jul 16, 2008 (gmt 0)

This hasn't been a problem for me in the past. But I just reviewed my logs for a couple of my larger sites and it's unbelievable how much crap is scraping my content. One site alone paid out 50 gigs of bandwidth in one month just looking at the top couple dozen visitors who are scraping the site.

Sigh. I guess it's time to consider doing something to stop all this. What are folks doing at the server level to stop this? Is there a method that works reasonably well that's not to crazy to implement?

ergophobe




msg:3700209
 8:53 pm on Jul 16, 2008 (gmt 0)

How about a honeypot approach?

Obviously, it wouldn't work with the AVG issue and anything like that which isn't crawling your links, but what about the other ones?

janharders




msg:3700229
 9:08 pm on Jul 16, 2008 (gmt 0)

that's right, ergophobe. it's pretty much what Bot-Trap uses. a hidden link to a directory which is disallowed for any bot in robots.txt. a user won't click, a bot that does not follow robots.txt might. a hit leads to some kind of analysis, possibly a captcha to prove you're human and indeed landed there by accident, and a ban if you fail to prove so.

Bentler




msg:3700267
 9:43 pm on Jul 16, 2008 (gmt 0)

Is there a way to tie in apache's logfile writes to a dynamically inserted piece of javascript when you have a dedicated server?

That would be a nice way to distinguish people with javascript enabled from people with javascript disabled + bots.

ergophobe




msg:3700311
 10:44 pm on Jul 16, 2008 (gmt 0)

Sure, I know how a honeypot works, but again, in the case of something like the AVG bot, it will do absolutely no good b/c AVG was not crawling your site, but rather going off Google results (which observe the robots.txt).

So the question is, do the bots Bill is complaining about crawl (and thus can be caught by a honeypot) or are they accessing the site in more sophisticated ways?

docbird




msg:3700378
 12:30 am on Jul 17, 2008 (gmt 0)

I've used AlexK script for bad bots - added it after bandwidth spikes due to rogue bots led my former shared server host to turn my site off for a couple of short spells. Worked, and nifty, tho likely not catching all.

Lately moved to host that's better overall (notably uptime), but the bots script not working - after a while, can ban me, and maybe all users, maybe after backup routine run on server.
Time to add IP lists; tho I'd read of the badder bots' IPs being in flux, and wonder if a few of these IPs might ban some potentially worthwhile visitors.

With the move, I've also "lost" awstats. Had been on previous host, in root - not sure if this is "locked down", so results not readily seen by joe public.
Haven't (yet) tried installing it on new host; I'd been looking at adsense stats - and only a day or two ago signed on with google analytics.

frontpage




msg:3700420
 1:38 am on Jul 17, 2008 (gmt 0)

what % of normal people do you think are blocking javascript? Curious because we have e-commerce sites that rely on it..

Our company routinely blocks javascript, cookies, adsense, etc via hosts file entries, NoScript, AdBlockPlus as rule.

You will probably find the Firefox users will probably block cookies and javascript at a higher rate compared to the Internet Explorer crowd as they tend to be a little bit more sophisticated with the 'internets'.

ergophobe




msg:3701015
 5:54 pm on Jul 17, 2008 (gmt 0)

As more and more sites are AJAX-driven, I'm thinking that fewer and fewer people will have JS disabled. I've lately been surfing with the Firefox NoScript plugin enabled and most sites have some major feature that doesn't work without JS.

I think that's a countervailing trend to the security issue. The problem is that of course it's not random. If you have a security-oriented site, for example, probably a very high percentage of your most important visitors will have JS disabled.

Scarecrow




msg:3701143
 8:45 pm on Jul 17, 2008 (gmt 0)

Lets say you have a directory that is disallowed in robots.txt. You put a dynamic page in that directory. The link to that page is hidden in some of your real pages that you allow bots to crawl. This link looks like a normal link, but it's really a link to that dynamic page in the disallowed directory. It's hidden behind a 2x1 (everyone seems to use 1x1) transparent gif on your legit pages. Bad bots that ignore robots.txt will try to follow this link. Anyone or anything that follows it is trapped by the dynamic page. Real users with real eyeballs never click on this link because they cannot see it. You have to be really patient to even get your mouse pointer to recognize there's a link there, because it's only a 2x1 transparent gif.

Your dynamic page logs the rogue IP address, and throws it into htaccess, and then at the end it spits out a Forbidden header.

Now this rogue bot keeps getting a 403 Forbidden from htaccess, but it doesn't care, it just keeps trying all the links it already has for your site. But at least you've now logged the bad guy's IP address, and can count its " 403 " entries in your access_log, and can take other measures later if necessary.

anallawalla




msg:3701214
 10:20 pm on Jul 17, 2008 (gmt 0)

Omniture stats are a sight to behold. Especially when there are high volume numbers. Try analyzing the logfiles of a site that large. Anyone here do that? I mean, are you micro-managing more than a million pages?

I resemble the first part of that (Omniture) and the multi-million pages remark, but not the micro-managing bit. We have a small, dedicated team working on analytics who can answer any of the obscure questions not readily accessible on a menu or dashboard. I think having more pages produces a more reliable estimate (web stats are estimates at best) than a smaller site with a few hundred or thousand pages that are more likely to be scraped fully.

A colleague and I have created separate dashboards to measure traffic to the same pages (~2M) and our respective counts are consistently out by 10 or 11 pages every day -- same Omniture data. Hence I treat web stats as estimates.

Receptional Andy




msg:3701223
 10:34 pm on Jul 17, 2008 (gmt 0)

Just as a data-point. I'm an individual who browses with javascript enabled on a permission-only basis - who buys a lot of things online ;)

AlexK




msg:3702157
 11:36 pm on Jul 18, 2008 (gmt 0)

wheel:
it's unbelievable how much crap is scraping my content
.
What are folks doing at the server level to stop this? Is there a method that works reasonably well that's not to crazy to implement?

docbird:
I've used AlexK script for bad bots ... Worked, and nifty
.
moved to host that's better overall ... but the bots script not working - after a while, can ban me, and maybe all users, maybe after backup routine run on server.

I've used the bad-bot-blocking script [webmasterworld.com] for about 5 years now. These are my observations:
  1. It's forte is catching fast-scrapers.

    They haven't been mentioned up to now in this thread. However, since the script regularly blocks attempts from bots on the same gigabit network as my server attempting to scrape the site at up to 50 times/sec, I think it deserves notice.

  2. The script parameters need reviewing every few months.

    The parameter-settings are governed by how busy the site is. More pages/day means turning the dials a bit.

  3. Server-enforced backup which touches the tracking-files is a weakness of the script.

    An effective solution is near the end of this thread [webmasterworld.com].

  4. I've abandoned slow-scraper blocking on most sub-domains within my site.

    Slow-scraper blocks put an upper limit on how many pages are allowed to any IP per day. The problem is that it infallibly catches Google & other 'good' scrapers, which then means using a whitelist. I've become convinced that fast-scrapers are the main issue, and therefore also drop the roll-over (restart) time to 4 hours, which makes the blocking more effective.


In summary, and in direct response to this thread, the distortion causes by all the bots is a real pain. AWStats says that 40-50% of all hits on my site are from bots, and they are just the ones that it can detect. As just one example, those figures do not include all the bots trying to do spam-posting into my site forums...

Even detecting the cloaked bots would take a far-more sophisticated algorithm than the bot-blocker script employs. And, since Google--the head-and-shoulders above everyone else algorithmic superior--is unable to stop sending advertisements promoting car-driving instructors for my site pages discussing modem drivers, I suspect that we may all have a little distance to go yet in that regard.

PS to docbird:
I retain your excellent sticky to me, and am anxious to implement your suggestions. Just a question of the time to do it.

[edited by: AlexK at 11:39 pm (utc) on July 18, 2008]

[edited by: jatar_k at 5:15 pm (utc) on July 31, 2008]
[edit reason] fixed link [/edit]

blend27




msg:3703019
 6:51 pm on Jul 20, 2008 (gmt 0)

Examining:

IP Ranges of known white listed bots
Headers
UAs
White listed proxies
IP Ranges(country specific stats from previous visitors VS. current visitor(AKA Trends))
Access Speed
JS Enabled(bots that request/and may be parse JS but do not support basic fluid layout syntax as self.innerWidth on sa div tag)
Random Honey Pots(forget about robots.txt at this point)
Robots.txt access by IP(know who is at you door)
Random redirects(sometimes serverside sometimes with JS based on the browserlevel) based on the score of the above
Checkout this sectio as ofte as you can: [webmasterworld.com...]

.. and a few more that I am still examining...

A few weeks ago I was pointed to a script that was fed a username and a password for PayPall account. Keeping the session(all cookies) it went 6 pages deep into the site. Made me feel like a rookie at all this...

blend27

Seb7




msg:3703280
 8:17 am on Jul 21, 2008 (gmt 0)

My Google analytics says 97.5% of my visitors have java.

Monitoring visitors have never been 100%. I used to prefer logs as client side stuff does add more bandwidth to your pages, but today, logs just don’t have enough information to make an accurate estimate. I wouldn’t recommend to anyone to use logs to monitor visits, in fact I've now turned them off completely.

Java does have a lot of information that can be pulled which gives a good indication of a real visitor, allowing us to get a reasonably accurate visitor numbers - which I believe Google analytics is doing quite a job at this.

janharders




msg:3703283
 8:29 am on Jul 21, 2008 (gmt 0)

GA is, in the end, a part of google's data mining. I know some people (the slightly paranoied ones) who block it completly so google won't know their every move. Guess that counts towards the audience thingy.
Also, I've found, that at least in germany, quite a few companies run a policy similar to noscript, no javascript unless you enable it.

blend27: sounds like a solid approach. did you write your analyzer yourself or is there some standard software that can easily be hacked to do that? logging all the headers can be quite a data overhead, but I guess it's worth it. Care to share those you're examining right now?
And the paypal-thing: it's a different thing to have a bot written or at least adapted for a single site. that's actually very easy with the help of some perl modules like WWW::Mechanize, but somebody who knows his business and really want's to extract data from your side is always hard to stop. You could delay him by having him sign up and then controlling how many pages a user may see a day, but that'll only slow him down.

blend27




msg:3703765
 7:35 pm on Jul 21, 2008 (gmt 0)

janharders,

It is something that I started working on right after Florida Update back in the day. Our Host at that time had a Stats program that took our rewritten SES URIs and started reporting to us that the top most visited destinations were subdirectories when in reality there are none(at least visible to the users or bots). The system sits on top of very well normalized RDB, so data overhead is not an issue at all. IIS Logs are also parsed into the DB, just for comparison purposes and as a backup for the numbers.

Standard Firefox headers sent via browser:

Cookie: JSESSIONID=60302d6d8d3b2e7c2e5e
Host: example.com
Keep-Alive: 300
Accept-Language: en-us,en;q=0.5
User-Agent: Mozilla/5.0 (Windows; U; Windows NT 5.1; en-US; rv:1.9) Gecko/2008052906 Firefox/3.0
Accept-Encoding: gzip,deflate
Accept-Charset: ISO-8859-1,utf-8;q=0.7,*;q=0.7
Connection: keep-alive
Cache-Control: max-age=0
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
request_method: GET
server_protocol: HTTP/1.1

AVG got cought for missing at least 4 of thouse, and lots of rogue bots as was mentioned before.

As far as the slightly paranoid ones, I am one of them.
Hosts file entries:

127.0.0.1 *.googlesyndication.com
127.0.0.1 *.doubleclick.net
..............

No Tool bars

And the values of browser.safebrowsing.provider.0.* in Firefox are BLANK as well.

IanTurner




msg:3704609
 4:44 pm on Jul 22, 2008 (gmt 0)

All the spiders I have are JS enabled, no way to tell them from a real browser.

Not only that but they can be programmed to change the javascript they find on a site.

janharders




msg:3704633
 5:00 pm on Jul 22, 2008 (gmt 0)

if they're js-enabled and you catch them by their request speed, just send out some code that runs them into an endless loop ;)

blend27: sound's like you've really put some work into that. Header analysis is usually a sure thing, even works the other way round: people who surf with googlebot's useragent rarely think about adding the From-Header.

MrWumpus




msg:3706769
 10:48 pm on Jul 24, 2008 (gmt 0)

I find it fairly easy to distinguish bogus from real users on our retail sites. I rely on raw logs only and use ClickTracks Optimizer. With ClickTracks I create visitor segments. For instance, we create a segment of people who visited our order confirmation page. That means people who actually placed an order with a real credit card. We can then study how that segment browses the entire site. We also filter out countries that we don't ship to, so that eliminates a ton of scrapers and bots. We then compare segments like those who abandoned the shopping cart, or left after certain error pages. We have a couple of years worth of data to compare and after all the filtering we're pretty confident the data is about as good as we can hope for from raw logs.

This thread has me thinking, however, of placing some sort of trap for bots and scrapers somewhere on our main page such as a 1-pixel link, or an invisible link, and then I can segment those visitors off since no human visitor would have seen or clicked those links. I'd want to be sure this doesn't affect legit search bots, though.

janharders




msg:3706777
 10:57 pm on Jul 24, 2008 (gmt 0)

>>I'd want to be sure this doesn't affect legit search bots, though.

just block the trap in robots.txt -- no legit bot should ever follow it then.

timchuma




msg:3708911
 1:52 am on Jul 28, 2008 (gmt 0)

I use a combination of Adsense/Analytics and a raw log analyser that makes web reports/pages.

I did manually analyse data from a log file for one particular website for six years via a database/excel spreadsheet.

The biggest traffic spikes are related to specific events so I can tell where the traffic is coming from and what they want to look at.

incrediBILL




msg:3709863
 12:46 am on Jul 29, 2008 (gmt 0)

Here's another prime example of why I say your log files are fiction:
[webmasterworld.com...]

I shared the list of user agents, many of them browsers, that a WordPress hacking script used to attack sites. Many other scripts use similar lists, many often longer as someone also points out in that thread.

Now just imagine there are many of these scripts or variations being run by hundreds or thousands of little kiddies wanting to leave their graffiti on your site, and you start to get a clue what some of that traffic really is.

Then you take the organized crime aspect from a country I won't mention, who likes to infiltrate residential computers by the 100s of thousands and use them as automated minions to do their bidding on the web, and the sheer scope of this problem becomes truly amazing.

Maybe your particular site only sees a little of this activity but mine sees a lot, WebmasterWorld sees a lot (just ask Brett), so be grateful if you're not in their crosshairs... yet.

This 55 message thread spans 2 pages: < < 55 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / WebmasterWorld / Website Analytics - Tracking and Logging
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved