I have made many
mistakes with page view logging database structures. ;) I few things I've learned: Automatic archiving
Everything outside of the last X days (where X depends on the situation, but typically 30 works) should be automatically moved to a separate database
entirely to keep writes to the "live" table fast. In my case, I also have my scripts automatically break the data into year tables, `pageviews_2006`, `pageviews_2007`, etc. Sometimes it makes sense to break them into smaller chunks (year_month) or chunks by other values (by site, section, or otherwise). Reduce column size
You probably store a ton of bytes in each row that can be stored in a more efficient way.
Use the right column types
- IP addresses Instead of using VARCHAR(16) use INT UNSIGNED and translate that data using INET_ATON() and INET_NTOA() or similar functions in your programming layer.
- User agent strings Instead of using VARCHAR(255) or, worse, LONGTEXT, create a second table to store user agent strings and generate an auto-inc serial number for each new occurence. In your main table, for useragent just have an INT UNSIGNED that references the primary key from the useragent table.
- URL accessed Same as the user agent string. Keep the textual representation in a second database.
- Use DATETIME for any timestamps.
- If you're logging any metadata like "human" or "bot", "session" or "no session", "loggedin" or "loggedout" be sure to use ENUM or SET instead of VARCHAR fields.
If you are going to allow users to generate their own reports on the data (for example, by customizing start/end dates for reports or selecting certain subsets of data), then having a read-only slave setup for those reports is vital. In fact, if you always
read from the slave then you can disable all indexes on your main table, making writes nearly instant. Otherwise, if a user locks up your "live table" with a report, you can erode performance for current users.