Forum Moderators: DixonJones
The answer to your question really depends on what you need to know about your site activity.
If you have just a few focussed questions you need to answer then your best bet is likely to be just writing a simple perl script to get the information.
If you have a broader range of requirements then you have 2 options:
1. Deploy one of the ASP tracking solutions that require you to put a small piece of javascript on each page. That way they have to deal with the volume and processing issue and you just get reports.
2. Buy a package that can manage your data volumes. If you have relatively simple requirements then WebTrends will probably be ok for you. You don't get much flexibility with WebTrends but it can crunch through logs relatively quickly and you can buy the low end log parser for not much money.
If your requirements and more complex then you might be looking at needing to spend a reasonable amount of cash to get the answers you need.
My main advice to you would be to start with working out exactly what the business needs to know and selecting a tool/service on that basis.
You may find that your best bet is a combination of approaches - WebTrends/another log parser for basic reports and a custom script for some of your more complex, business specific requirements.
One possible way is to write macro in Excell- Excell only holds 66000 records afaik. So 2.5 gb would equate toa hellava lot more than that. I tried this methods with only 250mb of data and it exceeded 66000 :(
I wrote a routine in my fav language (qbasic lol!) to do it for me, that works a lot better IMO.
The answer to your question really depends on what you need to know about your site activity.
Thanks for the feedback.
I am hoping to find out a couple things:
1) Referers and keywords
2) Clickstream
3) Bot/Spider activity
The problem is that our pages are dynamically generated in jsp, and so multiple "pages" may actually be the same page rendered with different URLs.
I have tried clicktracks and am looking at NetTracker, but I am really worried about the speed of the systems.
I have always shied away from WebTrends, since it has so many inherent issues, both in manageability and ease of use.
micah
By moving from standard raw formats to my own DB I cut it to around 10% of file size.
Start by excluding pictures and css and js files from your logs for example.
There is lots you can do before you worry about analysis to make things more manageable.
Also, merge tables in MySQL might be a good way to query files larger then 2gb at once. Need a fast HD and good indexing schemes though to make things real time.
SN
Let me address your points one at a time (but not in order):
1. The problem is that our pages are dynamically generated in jsp, and so multiple "pages" may actually be the same page rendered with different URLs.
This is your main issue - if you don't have the right raw data you won't be able to do the analysis. So, to address this issue accurately, it would be helpful to see an example of what one of your URLs might look like (you can de-personalise it so your site isn't referenced).
This will help us understand whether you have the page definition in the page query parameters (after the file name) or whether the page definition is not captured when the page is served. For example, you might have a set of parameters on each URL that says cid=12345 which identifies the page.
Even if you don't have the page identified in the URL, all is not lost :) You do have the option of deploying a page tag/bug type data capture solution which might be able to capture real-time properties of the page when it is viewed for analysis purposes. Understanding the raw data is key before attempting any time of analysis.
Also, be prepared for the possibility that your site may not be analysis firendly - I've seen site design issues before that have meant technical changes have been required prior to analysis being posisble. Not trying to scare you - just letting you know what is possible :)
2. Referrers and keywords. This is a common requirement which almost every software vendor should be able to meet. However, the main question is what do you want this information for - what are the actions you intend to take based upon this knowledge. For example, if you are looking for some kind of measure of customer value over time based on original source then you need to ensure you have a clear picture of a unique user over time - not simple. If you are just interested in counting clickthroughs then simple analysis should do the job BUT you really need to understand behaviour after arrival on site to understand how much that clickthrough was worth to you. If a visitor from a particular keyword hits the homepage and leaves - how valuable was that keyword? If they buy something or register then that keyword might be more valuable to you. So, its all about the context.
3. Clickstream. Be very wary of clickstream reporting. Everyone who looks at web analytics says "I must have clickstream". The reality is that with most softwre solutions you get the list of "top paths". This is a huge, messy set of complex paths which generally tells you not too much about your customers. It can be extremely hard to interpret and is generally very costly to compute. There are two technical approaches to clickstream reporting that I have seen that I think are useful and interpretable. Those are from ClickTracks (which you say you have tried) and at the other end of the cost scale from SPSS with their solution which uses data mining to discover significant paths. The thing that both have in common is that they can look at paths by user segment. Also, the SPSS solution can produce clickstreams by 'events' or important site activities.
Again, be clear about the business result you want from clickstream reporting. It's nice to know the paths that people take through your site but do you NEED to know to improve your online business? Again, think about the actions you can take when you know then answer to the question.
4. Bot/spider activity. Again, these are reports that almost every web analytics tool can provide. Most should be able to tell you which bots/spiders came to your site and when.
So, I would urge you to define your analysis requirements based on the actions you will take when you have the information. If you can't take action then it probably isn't trying to produce the report. Be hard on yourself when creating reports and make sure they have an action. If you are being asked for reports by the business, ask them what action they will take based on the data to justify their requirements.
Sorry that much of this doesn't give you specific answers to your questions. The reason for this is that without understanding exactly what your business needs and why it's hard to recommend the right solution/solutions.
Don't be afraid of using more than one tool to get the answers you need. Also, ASP providers may be useful to you if you are concerned about processing large data volumes and in-house resources to do the work. For most companies I know, they use a combination of tools and techniques to find the answers they need.
Sorry this is such a long post! Phew!