Forum Moderators: DixonJones
The solution I came up with was to import the log files to a desktop relational database which worked very well at the time.
Three months later and the database has started to become too big to be searched efficiently (it takes about 30 minutes to run a query for a page covering the full range of website logs - not counting organising the output and making graphs), and I was thinking of scaling up the database to a server based product.
Does anyone have any similar experiences they would like to share?
Thanks.
The thing that would help the most would be to index the page name field. Since your database is for reporting, you should be better off indexing around 5 fields and maybe even 10. Indexing shifts the processing load from the reporting time to the inserting time. Indexing is best done after the data is loaded. But that might not be an option for you if you load in more data each day.
The next thing I would consider is called "Normalizing". But it's a tricky topic that I will go in to only if you want me to. (Sometimes it's best to denormalize a bit when it comes to reporting.) Basically, normalizing breaks your data into several tables.
My next idea isn't much easier. Maybe you don't need all the records you're keeping. There may be many records that have the same page name and the same IP. You could add a count field and if the page name and IP already exist, increment what's in the count field. There are some tricks to doing this.
Finally, I think you should know how flawed what you're doing is. It's not very accurate. Simply counting IP addresses causes both over counting and undercounting. AOL, for example, uses dozens of proxies both for individual users and for thousands of users. So, what adds up to 100 visitors might really be anywhere from 10 to 1000. It's really that bad. It might help a little if you also record the visitor's browser (called user agent).
1) don't keep data in there forever. Your database will get huge fast. It seems more appropriate to do reports each [month¦week¦quarter] and then remove the data. I personally run reports each week and only keep data for a month. If you want to keep the data around, make flat file logs also, or dump the data out of the database periodically.
2) You may find indexes hurt more than help. As was mentioned, it transfers work from select time to insert time. If you're putting stuff in the database in real time, you want it to be quick. Let your reports be slow. You can afford to let them run overnight and look at them tomorrow.
3) Think really hard about your queries. As was mentioned, counting IP addresses is not very accurate, but neither is counting unique IP addresses. What is more accepted is unique IP addresses each day. In other words, if a single IP address hits your page 4 times each day then that would be 7 hits in a week, not 28 and not 1. These problems seem to crop up a bit more when you're dealing with the data like this.
4) see Randal Scwartz's article on pulling reports from your database for a starter [stonehenge.com...] but don't use his method for putting the data in there. Ask wrote a nifty perl module to do it. See Apache::DBILogger if you want a mod_perl module to do it for you.
5) if you just want to find out how many people have hit a single page, you may find it's easier to just use a log analysis tool like wusage or analog or whatever is in vogue now. What I would do is to use grep, wc, and other command line tools to pull it out of my logs. You might be killing a fly with a hammer.
Hope it helps.
We have a pretty large online newsgroup service where we had been doing searches and logging. The only solution we found that was scalable was DB/2 - but running under OS/2 or eComStation. The AIX version should be as scalable and able to handle as much or more traffic. This is the same database server engine that flawlessly handled the Nagano Olympics with OS/2 doing the real time score updating and everything being cataloged, searched and indexed on DB/2 on AIX and OS/2.
Dont believe ANYONE who says databases cant handle the traffic. Nagano peaked at 124,000 requests PER MINUTE.
On a more real life scenario (I doubt you are running a site so widely trafficked) our 300MHz PII 300 has served 200,000 pages a day and easily peaked at 200 requests a second (image, html and database combined) staying at about 6-10% CPU utilization or less during peak usage.
DB/2 uses different algorithms for high traffic timeframes than for normal usage (automatically) including probably one of the best (THE best?) heuristics-based caching engine in the industry.
There are other MAJOR key differences. Threading is the LARGEST difference for high traffic sites. Recent Linux version Apache (2.0+) releases have had their threading engine improved, thanks to some thread code donated from IBM that came from OS/2, and much Linux developer/contributor work. But, Linux still does not have nearly as robust of a threading engine. There are porting articles online that discuss this (I can find the technical references for anyone interested - sticky me). Basically, thread scheduling, management, timeslicing and such will greatly affect the performance of your database. So, with threading incorporated in Apache for Linux, there is an increase in performance due to the less overhead a thread requires over a process (Apache's 1.X method of handling requests), UNTIL the traffic increases over what the OS can handle and schedule efficiently. OS/2 can efficiently do 4,096 threads per process per CPU (and up to 64 CPUs per NODE - nothing else PC based can do this, and the IBM x430 server is the only machine I know that can do 64cpu per node - true 64 way - can you imagine a cluster of them running OS/2 in clustered mode? I cant!).
AIX's threading engine is at least comparable... as are many other large scale unix's. Linux needs a kernel rewrite to become as efficient. 200 threads is the reported "efficient cap" on today's average server hardware under Linux - super high end hardware I would presume have higher limits.
Windows, in any incarnation, bombs before then. 200 PROCESSES (each requiring 1 or more threads) can drag most Win implementations to their knees.
Now, as for database engines, MySQL is a great database engine, but (at least using it's own table types) does not seem to handle real time, high transaction, high insert, high query tables well. Using InnoDB, we have had a lot more success, but it requires more disk space and a little more overhead. The other big difference is that MySQL has different threading, process and memory limits based on the OS you are running it on.
Access and MsSQL based databases are a joke for really high traffic databases... dont even bother. (my definition of high traffic is probably different than most people's though. You all would be amazed at what a "mere PC" can really do. Our main network server (ftp, smtp, pop3, backup web, DNS, file, print, time, and domain (NetBIOS) is a "mere" Pentium 120 with 64MB of RAM... our big newsgroup server (talked about up top) is only a PII 300. OS/2 and eComStation raised our expectations to the type level that the original poster needs to do truly high end searchable, indexed, database based stats on a PC).
To summarize, with the tons of traffic our site got under DB/2 v4 on OS/2 Warp Server for e Business (1996 release), we could expect queries to take fractions of a second anyway. On the 2001 release of OS/2 (or the current release of eComStation) and DB/2 v7 we are told to expect even better, but we havent yet implemented it yet.
- Rob
If you are serious about doing this, and you dont have lots of hardware lying around, I'd suggest eComStation (you can guess the URL to find it) and DB/2 v4 to DB/2 v7. (BTW, all our DB/2 high end results were done with v4... v7 is supposed to be more capable).
I have decided to keep using the same database but I will only use it to store a month's data at a time as it was starting to get rather large.
Another problem I had which I found out was the query I was using included the clause "LIKE" on a Memo field which made it very slow.
Thanks.