|Large log table, what would a DBA do?|
archive last year's data?
I have a large MySQL MyISAM table that logs page views. The current size of the table is 7GB. It has been running for 18 months.
The page views log is the basis of licensing payments to a third party. We insert to the table all day long (for every page view), and then we query every quarter. After the end-of-year query in January, we don't need to look at last year's data unless we get audited.
Before we started this process, our db was a manageable size. Now it is really hard to work with mysqldump because the db is over twice the size it used to be. And of course, it just keeps getting bigger.
I am wondering if I should start archiving last year's data out of this table. I could do something like this:
CREATE table page_views_2010 as SELECT * FROM page_views WHERE year(mdate) = 2010
DELETE from page_views WHERE year(mdate) = 2010
I wonder if that is what a well-informed DBA would do? Is it good practice? Or should I just live with the table getting larger and larger and larger? Any other advice?
And how many records is 7GB worth?
i'm not sure i qualify as a well informed dba, however i do think it good practice to archive old data that isn't being updated - the usual way would be into an archive table as you have suggested per day/week/month/year, whatever seems reasonable.
of course it would be easy enough to run queries on the entire dataset using UNION if required.
would it be possible to create aggregate/sum data that is acceptable every quarter or so, which could be written to another table in whatever appropriate manner and the original raw data deleted or permanently archived off site (eg deleted from the db to help your dump problem)
I have made many mistakes with page view logging database structures. ;) I few things I've learned:
Everything outside of the last X days (where X depends on the situation, but typically 30 works) should be automatically moved to a separate database entirely to keep writes to the "live" table fast. In my case, I also have my scripts automatically break the data into year tables, `pageviews_2006`, `pageviews_2007`, etc. Sometimes it makes sense to break them into smaller chunks (year_month) or chunks by other values (by site, section, or otherwise).
Reduce column size
You probably store a ton of bytes in each row that can be stored in a more efficient way.
- IP addresses Instead of using VARCHAR(16) use INT UNSIGNED and translate that data using INET_ATON() and INET_NTOA() or similar functions in your programming layer.
- User agent strings Instead of using VARCHAR(255) or, worse, LONGTEXT, create a second table to store user agent strings and generate an auto-inc serial number for each new occurence. In your main table, for useragent just have an INT UNSIGNED that references the primary key from the useragent table.
- URL accessed Same as the user agent string. Keep the textual representation in a second database.
Use the right column types
- Use DATETIME for any timestamps.
- If you're logging any metadata like "human" or "bot", "session" or "no session", "loggedin" or "loggedout" be sure to use ENUM or SET instead of VARCHAR fields.
If you are going to allow users to generate their own reports on the data (for example, by customizing start/end dates for reports or selecting certain subsets of data), then having a read-only slave setup for those reports is vital. In fact, if you always read from the slave then you can disable all indexes on your main table, making writes nearly instant. Otherwise, if a user locks up your "live table" with a report, you can erode performance for current users.
Just a suggestion but if you are only writing to this table and not reading from it often or updating rows once inserted then I would suggest writing to flat files instead of into a DB.
You can have your method that does the writing create a new file for every quarter as this is what you report on. You can have a cron task compress the older files and move them to an archive directory.
It may be a little more work to build a method that will create your quarterly reports but it will be a lot less work for maintaining the data, and it should give a performance boost to other methods that are reading/writing from the DB.
Thanks very much for the replies. Glad I asked!
whoisgregg, thank you for sharing your experience. Very helpful.
I am working on creating the page_views_2010 archive database and have a follow up question:
Is it wise to create the archive table using CREATE table AS SELECT... syntax?
I notice the table doesn't end up having a column marked as the primary key or the autoincrement feature.
Should I create the table from the page_views table structure dump and then populate it separately?
Thanks for any input.
I use this query:
$sql = "CREATE TABLE IF NOT EXISTS `backup_db`.`log_".$archive_year."` LIKE `main_db`.`log`;";
I also disable keys before dumps and enable keys after dumps to speed it up.