Msg#: 4618832 posted 3:32 pm on Oct 24, 2013 (gmt 0)
In the process of rewriting my site (including reorganizing and updating the db structure) and I'm looking to optimize in a few areas.
I am looking for some thoughts on the best way to store data then retrieve and display it efficiently. To get things started lets discuss a often debated topic!
Here is an example to illustrate my first basic question:
Picture it: A Questions & Answers Section
DB: You have a table for Questions, one for Answers.
SITE: In the list of questions it shows the question title (the question, who posted it, when it was posted etc.) it also shows the current number of answers. You then click to go view the answers.
Things to consider: tens of thousands of requests per day
Now for my question: Is it best to calculate the number of answers whenever the list of questions is generated (through a join) or is it best to keep a field (column) in the question record for the number of answers (and update this number when an answer is added or deleted)? The first is more intensive but always accurate, the second is more efficient but relies on manual updating which can lead to inaccurate data...
This question applies to so many different parts of a web application. I've done both over the years, but honestly have mixed feelings about them both. One offends my sense of speed, the other my need for a clean, accurate database.
Which is best and why? Perhaps there is even a third (and forth) method that I haven't considered?
Msg#: 4618832 posted 4:10 pm on Oct 24, 2013 (gmt 0)
It also depend on your environment.
If you have a cache in front of your webserver or all your queries are in the query cache you could consider the clean database. Otherwise i would go for speed with a daily/weekly job to update the stats.
In my design i use several stat tables to keep the base database design clean.
Msg#: 4618832 posted 4:32 pm on Oct 24, 2013 (gmt 0)
@topr8: I don't mean literally manually updating the table myself, I mean that when a new answer is added/deleted the Question table would need to be updated to reflect the change. (This relies on all areas that allow for modification to answers to be "manually" coded to update the Questions table.)
Msg#: 4618832 posted 8:23 pm on Oct 25, 2013 (gmt 0)
Before the web, my database design experience almost always required third normal form [en.wikipedia.org].
On the web though, that division of data across multiple tables can really really kill performance. (If you need to perform multiple joins so that people can see the basic info in a list or grid view, you're doing it wrong.)
I think we've all heard time and time again how well users respond to fast sites. Speed for the user has to trump the obsessions of database admins for clean design. ;)
What I've found best is to mix the two approaches. I keep a really clean database structure underlying everything I do, but I also have what I call "cache" columns or sometimes entire "flattened" tables with some joined data and calculated data. Those are used to speed up querying the data.
In your example, I'd have a column in the `questions` table called `answer_count_cache` that is updated any time an answer is added or deleted. If database changes are possible outside of that code, I'd also run a cron at a reasonable interval to double check that the value is correct.
This really, really comes into play when you need to do something intensive like, for example, ranking the questions by how many pageviews they've received. Your pageview table will be enormous and there's just no way you'll be able to afford to run a 3-4 second query to spit out how many views each question has received in the last week/month/lifetime -- just to rank the questions in the first place.