| This 37 message thread spans 2 pages: 37 (  2 ) > > || |
|Normalization - A Brief Introduction|
databases, unlike web geeks, need to be normal
In the movie Young Frankenstein Igor is charged with getting a brain for the monster. He drops the brain he was after and instead brings back another one. After the wrong brain was "installed" and it was discovered, the Doctor asked him what brain he brought back, to which Igor replied, "Abby someone. Yes, Abby Normal."
Normal and Abnormal are also terms used to describe the creation of a relational database. A database is said to be normal when all traces of redundant data are eliminated.
Putting the R back in RDMS
The R stands for Relational, as in relative and relates. How many times are we introduced as "Bill's Son" or "Sally's Husband" or "Arnold's Grandson?" Those references to YOU are using a relational model.
Your name might be John, but at school your dad is well known - you are Bill's son. Someone made a relationship between you and your dad.
In a database you might have a table of names:
But you also want to store information about those people, their relatives, their addresses hair color, etc. Some data will be unique to that name, but some might not be. Let's take a basic example - names and phone numbers.
name, phoneno, type
John Smith, 212-555-1212, home
John Smith, 212-555-1234, cell
John Smith, 212-555-9898, office
Fred Jones, 718-555-1212, home
Fred Jones, 718-222-2222, fax
Fred Jones, 718-111-1111, cell
Fred Jones, 718-333-3333, pager
In that simple example above, you can see that there is redundant data - the name of the person. It would be better to split this into two tables using some type of relation between the tables to identify the data. I'm going to use an ID number for each name so when I have to find the match later, I can do it by numbers rather than letters. Databases prefer numbers.
101, John Smith
102, Fred Jones
id, number, type
101, 212-555-1212, home
101, 212-555-1234, cell
101, 212-555-9898, office
102, 718-555-1212, home
102, 718-222-2222, fax
102, 718-111-1111, cell
102, 718-333-3333, pager
Better, but we still have a little redundant data here - the phone type. So, we could add a third table with the type codes:
id, number, type
101, 212-555-1212, h
101, 212-555-1234, c
101, 212-555-9898, o
102, 718-555-1212, h
102, 718-222-2222, f
102, 718-111-1111, c
102, 718-333-3333, p
Anything else that can be simplified? Or normalized? Perhaps. But I'd be happy with this level of normalization. I can add new types easily. When I add a new person I'm not creating empty space in the first table for the phones and types.
Normalization goes way beyond this simple example. Database Administrators get into fist fights over the level of normalization required. But it is valuable when planning any database project. Planning to be normal from the start will save lots of grief in the future (I speak from personal experience....)
No, really, I mean that.
I'm a db designer by trade in my day job, and normalisation is second nature.
So you'd think jumping into dynamic sites with a db backend would be second nature for me too, right? Wrong. Because, and only because, the many examples and tutorials I've read have pretty much ignored normalisation in favor of a 'flat file' approach* - to the extent that I've been more or less convinced that a well-normalised database must present some problem in terms of php or the way SQL is implemented online or something.
So you're saying it shouldn't be a problem?
I can design and normalise my databases much as I would for an offline app?
* ok, obviously that could be a matter of the writer thinking 'let's not confuse the newbies too much', but in my case it did just the opposite.
|Planning to be normal from the start will save lots of grief in the future (I speak from personal experience....) |
I'll vouch for that absolutely!
|So you're saying it shouldn't be a problem? |
I can design and normalise my databases much as I would for an offline app?
Absolutely! In fact it would help most web apps to be normalized because it can reduce the number of queries to the database (the slowest part of web apps).
The only downside is that the queries become more complex, involving inner and outer joins (more on that later).
Which is probably why most tutorials take a flat file approach. It's easier to build a simple query than to try to explain joins.
Most small scale databases gain nothing from normalisation and in fact this extra complexity increases code time: when tables are small they will get table scanned anyway, indices won't help, so instead of one denormalised table scan, you will get many.
So, IMO, small databases don't need to be seriously normalised.
Big databases are an exception too - denormalisation is an acceptable technique designed to improve query performance -- all those fancy joins don't come free you know!
Great post txtbakers, Great Topic!
Complexity? That depends how its managed, in a good design the first tier you have the underlying normalised tables. I would always go to 3rd normal form (no repeating groups, no redundancy), avoid allowing nulls and use inner joins only. I would need very good reasons not do this for any real/live system.
The next tier is the logical views (Or queries that you can query in Access!)that do look more like flat files or business entiies. These make queries dead easy as there no joins when selecting from them.
I might go back to 2nd normal form (which is different from denormalising!) and shove the data into a cube for analysis but for transaction systems no,no.
> small databases don't need to be seriously normalised.
Many large databases started out as small databases. Converting a badly normalized database into something proper can be quite a task, to say the least. Believe me, I've been there :(
Unfortunately your TYPES table is unnecessary.
Type is an attribute of phone and does not need a separate table, as it has no attributes of its own other than the type value (home, office, cell). You are forced to invent another attribute (h, o, c, etc) just to justify the existance of the type table. This is known as an over normalised set of tables - normailised more than is necessary to remove reundancy. You could just as easily have used h, o, c in PHONES without the TYPES table existing!
The example would have been better if you'd included address details for each name, to show how much redundant data was being repeated before the names were separated from the phones.
Yes, you're right, but I wanted to keep it simple for the first intro.
I had to create some tables last week that were purposely redundant. It hurt me to do it but I couldn't find any other way to do it based on user requirements.
Actually PhoneTypes is not over normalised .
id, number, type
In the above the data "home" is stored twice in the database & that is redundancy.
The tables below solve this problem.
id, number, type
Now if you wish to rename "Home" to "Home Telephone Number" you only have to do it in one place - the phone types table and not every record in Phones with a Home Phone.
I like the topic and have worked with normalized dbs for quite some time now. If you ever get the chance to work with CNet you'll be in for some fun because it's like 4NF and it takes some time to get a true understanding of the data model.
One thing that I would add to the conversation is that highly normalized datamodels can add to the complexity of your queries in the form of joins. Some of the larger systems that I've worked on actually required processes to flatten out the data into read-only tables that we're built daily in order to avoid writing sprocs that made 5 or more joins on large tables.
I have a relatively complex database model for one of my sites that started out fairly well normalized, but I ended up needing to build flattened summary tables due to performance issues. Many queries were joining 5-10 different tables, including several one-to-many relationships. I like to keep the time it takes to generate any one page on my site below 1/10th of a second, and some of the individual queries were taking several seconds. On top of that, the complexity was making it a bear to maintain.
great post txbakers. explains it in really simple terms.
i have found using normalised tables for even small apps can greatly ease administration tasks - i can only recommend it. plus for all but the most dynamic websites / forums, sql queries can be cached, meaning joins needn't be any sort of bottleneck.
The problem with normalization is that it is almost always a judgement call. Sort of like encapsulation in OO design, it's great in theory, but if you cross all the i's and dot all the t's...
To add to that - the addition of the phone types table naturally provides domain constraint, ensuring we don't get phone types like Mom's Boyfriend mixed in with our phone numbers. Not too likely if you have only one page that can update the phone numbers, but if the database becomes shared accross multiple websites, for example, things can start to fall into dissarray without constraints.
|Now if you wish to rename "Home" to "Home Telephone Number" you only have to do it in one place - the phone types table and not every record in Phones with a Home Phone. |
For a large-scale (or growing) sites a DBA usually goes through 2-step process: 1) normalization (making parts of data as small as it makes sense), and then 2) de-normalization (i.e. it is unpractical to have more than 3-4 table joins in a query). Every DB application (read "website") is different.
In a live DB it is always a trade-off between ease of updating and speed/ease of data extraction.
This is what txbakers has already done, execpt that letters were used instead of numbers for the key (h instead of 1)...
But you're right about your justification for PhoneTypes being as a seperate table afaik.
physics, I belive aspdaddy used integers because they scale out better then letters and they take less size which is important when you start getting into tall tables.
phone types doesn't demonstrate that point because there are only a few. But imagine having a 500 types that needed unique keys:
scales better then
there still is redundant data in the PHONES table.
Why multiple entries for a user's id? In my opinion, using an id or a name isnt much of a difference, there are still duplicates.
You could always use a longtext type field and enter the telephone numbers as as comma seperated string, doing the same for the associated phone types
Key to this approach is ofcourse beeing able to insert and update the field as required and in good order, we dont want cellphone numbers with work numbers for an example now do we.
simple PHP implode() and explode() functions can perform simple operations here
downside: more PHP code to be written and way moe complicated queries
it all comes down to preference I guess, when you design both the DB AND develop the code AND design the site, like I do
dmmh agreed that there is redundant data but this is just conceptual...
|Why multiple entries for a user's id? In my opinion, using an id or a name isnt much of a difference, there are still duplicates. |
The name is user space data. Names belong to humans and names are entered by humans. Humans make typos and marry or change their mind about their name. The id, OTOH, is usually solution space data which is automatically generated on insert. Automatically generated ids are guaranteed to be unique. Names are not. Thus using a name as an id is a bad idea because two different humans can have the same name but they should have different ids.
Regarding the "normalization increases code complexity" argument: It does. But it also creates more compact code. If you keep redundant data in several tables, you need to update several tables when the data changes. With normalization only one table needs to be updated.
The only complexity I see is in 'joins' and it should not go beyond the bounds of the database. The db model should be carved out regardless of the programming languages that may access the database.
|Easy_Coder: The only complexity I see is in 'joins' and it should not go beyond the bounds of the database. |
That's why MySQL has not graduated into a real world yet - although I realize many on this site use it successfuly.
But for a robust enterprise application you want your DBA to write stored procedures to handle complex data manipulations. That will make your DB independent of the middle tier or front-end programming language
|The name is user space data. Names belong to humans and names are entered by humans. Humans make typos and marry or change their mind about their name. The id, OTOH, is usually solution space data which is automatically generated on insert. Automatically generated ids are guaranteed to be unique. Names are not. Thus using a name as an id is a bad idea because two different humans can have the same name but they should have different ids. |
ofcourse, I realise that, I dont use anything else myself. But I guess what I mean is, how far is one willing to go with normalisation? The method I mentioned is by far cleaner from a DB admin's point of view, as it takes less rows. From a developers point of view, it complicates queries, especially the JOIN type queries, by an enormous amount.
tbh, for my site I use both. I guess it depends on the application being developed for it.
Nice article on normalisation.
However something to bear in mind is that if you're designing a database just for web data (eg. content management) usually you want as flat a database as you can get. If you've got a lot of joins in your stored procedures/ SQL queries, it will take more processing power for the server to return the data, thus reducing performance.
Normalisation certainly has it's place, especially for a high maintanence databases and is very popular in internal company systems.
There's a big industry in database denormalisation at the moment. A lot of database developers have been trying their hand at web development and unless they take the time to appreciate the difference between web databases and internal databases and this gan result in performance degeredation.
It all comes down to the individual application. As txbakers points out, if your only other alternative is to create multiple SQL statements / stored procedures, then normalisation is the better of two evils.
If you're writing a high traffic web site that intensively uses a database, think twice before excessively normalising your database.
|But I guess what I mean is, how far is one willing to go with normalisation? |
True. It occurred to me that normalization in itself is not an ideal. A fully normalized database is the one where you have no NULL values and no duplicated values. It's made up of two-column tables where the first column is the id and the second column is a value. That would be one extreme. The other extreme is put everything into one table.
For example, if the unnormalized db is
I ¦ A ¦ B ¦ C
1 ¦ r ¦ t ¦ u
1 ¦ r ¦ t ¦ v
2 ¦ e ¦ f ¦ -
3 ¦ k ¦ l ¦ m
with I being the id and a dash symbolizing the NULL value, the fully normalized version would be
I ¦ A
1 ¦ r
2 ¦ e
3 ¦ k
I ¦ B
1 ¦ t
2 ¦ f
3 ¦ l
I ¦ C
1 ¦ u
1 ¦ v
3 ¦ m
The ideal lies somewhere between those two extremes. Because Column A and B have a cardinality of one (no more than one value per id) and no NULL values, it is not necessary to factor them out into separate tables. Column C has more than one value per id and NULL values, so it makes sense to factor it out:
I ¦ A ¦ B
1 ¦ r ¦ t
1 ¦ r ¦ t
2 ¦ e ¦ f
3 ¦ k ¦ l
I ¦ C
1 ¦ u
1 ¦ v
3 ¦ m
|If you're writing a high traffic web site that intensively uses a database, think twice before excessively normalising your database. |
This statement can probably be extended to any application where the number of SELECTs by far outweighs the number of INSERTs, UPDATEs and DELETEs. The more SELECTs you have, the more your app will suffer from the extra IO penalty that's incurred by a join (yes, the penalty lies in IO, not CPU).
I still think that for any large scale application - web or not, normalization is important and necessary. You shouldn't let your db design be polluted by performance considerations. That's where object-relational mapping (ORM) tools kick in. With ORMs you design an object model, not a DB model. The ORM then creates the corresponding DB model automatically. Furthermore, the ORM should let you tweak the trade-off between the aforementioned two extremes, i.e. unnormalized and fully normalized backend. Maybe it's even able to determine the optimal level of normalization on its own.
The usual end point for normalisation is when you reach the third normal form.
That means all the elements in a row depend only on the key columns.
3NF is sometimes described as meaning the data depends upon "the key, the whole key and nothing but the key".
There are "higher" normal forms but they rarely matter outside of an undergraduate computer science course.
And, as several people have already noted, it is completely normal even among database professionals to back away from the 3NF for performance and tactical reasons.
Normalisation is not the only guiding principle in knocking together data structures that will stand the test of time. For example, there is nothing in normalisation that'll tell me how I should split my database across separate data centers for performance.
Similarly, this post could be held in a database keyed by a unique WM[b][/W] message number. Or the key could be userid+user message number. Normalisation won't settle which is best. But, it will help sift out how other tables should look once that decision has been made.
(But that decision will most probably be reached by looking at performance. Having a single process that issues unique message numbers may be a bottleneck on a high-message-rate board).
Great post, txbakers.
Just wanted to add something that may not be obvious to many: The key to determining what level of normalization or denormalization to use is by looking at the queries. In most applications you'll find that the majority of the queries are more or less the same... especially when it's something that called per page view.
Sometimes it's best to denormalize a bit to speed up common queries. And sometimes it can make more sense to run more than one query in a row rather than a single join.
|That's why MySQL has not graduated into a real world yet |
I'm having a wee smile, I didn't know what this meant until recently.
Thank You, txbakers, for the great explanation!
I've worked with DB's for a long time and normalisation was normal for me in an application environment, until I moved them over to web environment, then it became easier (and faster) to make them flatter (CMS style as someone already said)
I hate myself, it's going against all the grains, but if I were to transer my "normalised" DB's would it then have way too many Stored Procedures (SP's), joins for MySQL to deal with without affecting speed? It was easier with ACCESS (Queries), but apparently it can't stand the same degree of connection loads..
|Sometimes it's best to denormalize a bit to speed up common queries |
Why is that best? Much better to add a stick of RAM to the database and let it do its job.
| This 37 message thread spans 2 pages: 37 (  2 ) > > |