Forum Moderators: open
Normal and Abnormal are also terms used to describe the creation of a relational database. A database is said to be normal when all traces of redundant data are eliminated.
Putting the R back in RDMS
The R stands for Relational, as in relative and relates. How many times are we introduced as "Bill's Son" or "Sally's Husband" or "Arnold's Grandson?" Those references to YOU are using a relational model.
Your name might be John, but at school your dad is well known - you are Bill's son. Someone made a relationship between you and your dad.
In a database you might have a table of names:
John Smith
Fred Jones
Carla Bloop
But you also want to store information about those people, their relatives, their addresses hair color, etc. Some data will be unique to that name, but some might not be. Let's take a basic example - names and phone numbers.
NameTable
name, phoneno, type
John Smith, 212-555-1212, home
John Smith, 212-555-1234, cell
John Smith, 212-555-9898, office
Fred Jones, 718-555-1212, home
Fred Jones, 718-222-2222, fax
Fred Jones, 718-111-1111, cell
Fred Jones, 718-333-3333, pager
In that simple example above, you can see that there is redundant data - the name of the person. It would be better to split this into two tables using some type of relation between the tables to identify the data. I'm going to use an ID number for each name so when I have to find the match later, I can do it by numbers rather than letters. Databases prefer numbers.
Names
id, name
101, John Smith
102, Fred Jones
Phones
id, number, type
101, 212-555-1212, home
101, 212-555-1234, cell
101, 212-555-9898, office
102, 718-555-1212, home
102, 718-222-2222, fax
102, 718-111-1111, cell
102, 718-333-3333, pager
Better, but we still have a little redundant data here - the phone type. So, we could add a third table with the type codes:
Phones
id, number, type
101, 212-555-1212, h
101, 212-555-1234, c
101, 212-555-9898, o
102, 718-555-1212, h
102, 718-222-2222, f
102, 718-111-1111, c
102, 718-333-3333, p
Types
typecode, type
h, home
c, cell
o, office
f, fax
Anything else that can be simplified? Or normalized? Perhaps. But I'd be happy with this level of normalization. I can add new types easily. When I add a new person I'm not creating empty space in the first table for the phones and types.
Normalization goes way beyond this simple example. Database Administrators get into fist fights over the level of normalization required. But it is valuable when planning any database project. Planning to be normal from the start will save lots of grief in the future (I speak from personal experience....)
No, really, I mean that.
I'm a db designer by trade in my day job, and normalisation is second nature.
So you'd think jumping into dynamic sites with a db backend would be second nature for me too, right? Wrong. Because, and only because, the many examples and tutorials I've read have pretty much ignored normalisation in favor of a 'flat file' approach* - to the extent that I've been more or less convinced that a well-normalised database must present some problem in terms of php or the way SQL is implemented online or something.
So you're saying it shouldn't be a problem?
I can design and normalise my databases much as I would for an offline app?
* ok, obviously that could be a matter of the writer thinking 'let's not confuse the newbies too much', but in my case it did just the opposite.
Planning to be normal from the start will save lots of grief in the future (I speak from personal experience....)
I'll vouch for that absolutely!
So you're saying it shouldn't be a problem?I can design and normalise my databases much as I would for an offline app?
Absolutely! In fact it would help most web apps to be normalized because it can reduce the number of queries to the database (the slowest part of web apps).
The only downside is that the queries become more complex, involving inner and outer joins (more on that later).
Which is probably why most tutorials take a flat file approach. It's easier to build a simple query than to try to explain joins.
So, IMO, small databases don't need to be seriously normalised.
Big databases are an exception too - denormalisation is an acceptable technique designed to improve query performance -- all those fancy joins don't come free you know!
Complexity? That depends how its managed, in a good design the first tier you have the underlying normalised tables. I would always go to 3rd normal form (no repeating groups, no redundancy), avoid allowing nulls and use inner joins only. I would need very good reasons not do this for any real/live system.
The next tier is the logical views (Or queries that you can query in Access!)that do look more like flat files or business entiies. These make queries dead easy as there no joins when selecting from them.
I might go back to 2nd normal form (which is different from denormalising!) and shove the data into a cube for analysis but for transaction systems no,no.
My 2c
Type is an attribute of phone and does not need a separate table, as it has no attributes of its own other than the type value (home, office, cell). You are forced to invent another attribute (h, o, c, etc) just to justify the existance of the type table. This is known as an over normalised set of tables - normailised more than is necessary to remove reundancy. You could just as easily have used h, o, c in PHONES without the TYPES table existing!
The example would have been better if you'd included address details for each name, to show how much redundant data was being repeated before the names were separated from the phones.
Phones
id, number, type
101, 212-555-1212,home
102, 718-555-1212,home
In the above the data "home" is stored twice in the database & that is redundancy.
The tables below solve this problem.
Phones
id, number, type
101, 212-555-1212,1
102, 718-555-1212,1
PhoneTypes
1,Home
Now if you wish to rename "Home" to "Home Telephone Number" you only have to do it in one place - the phone types table and not every record in Phones with a Home Phone.
One thing that I would add to the conversation is that highly normalized datamodels can add to the complexity of your queries in the form of joins. Some of the larger systems that I've worked on actually required processes to flatten out the data into read-only tables that we're built daily in order to avoid writing sprocs that made 5 or more joins on large tables.
i have found using normalised tables for even small apps can greatly ease administration tasks - i can only recommend it. plus for all but the most dynamic websites / forums, sql queries can be cached, meaning joins needn't be any sort of bottleneck.
Now if you wish to rename "Home" to "Home Telephone Number" you only have to do it in one place - the phone types table and not every record in Phones with a Home Phone.To add to that - the addition of the phone types table naturally provides domain constraint, ensuring we don't get phone types like Mom's Boyfriend mixed in with our phone numbers. Not too likely if you have only one page that can update the phone numbers, but if the database becomes shared accross multiple websites, for example, things can start to fall into dissarray without constraints.
In a live DB it is always a trade-off between ease of updating and speed/ease of data extraction.
phone types doesn't demonstrate that point because there are only a few. But imagine having a 500 types that needed unique keys:
1
2
3..
499
500
scales better then
a
b...
aa
bb
downside: more PHP code to be written and way moe complicated queries
it all comes down to preference I guess, when you design both the DB AND develop the code AND design the site, like I do
Why multiple entries for a user's id? In my opinion, using an id or a name isnt much of a difference, there are still duplicates.
The name is user space data. Names belong to humans and names are entered by humans. Humans make typos and marry or change their mind about their name. The id, OTOH, is usually solution space data which is automatically generated on insert. Automatically generated ids are guaranteed to be unique. Names are not. Thus using a name as an id is a bad idea because two different humans can have the same name but they should have different ids.
Regarding the "normalization increases code complexity" argument: It does. But it also creates more compact code. If you keep redundant data in several tables, you need to update several tables when the data changes. With normalization only one table needs to be updated.
Easy_Coder: The only complexity I see is in 'joins' and it should not go beyond the bounds of the database.
That's why MySQL has not graduated into a real world yet - although I realize many on this site use it successfuly.
But for a robust enterprise application you want your DBA to write stored procedures to handle complex data manipulations. That will make your DB independent of the middle tier or front-end programming language
The name is user space data. Names belong to humans and names are entered by humans. Humans make typos and marry or change their mind about their name. The id, OTOH, is usually solution space data which is automatically generated on insert. Automatically generated ids are guaranteed to be unique. Names are not. Thus using a name as an id is a bad idea because two different humans can have the same name but they should have different ids.
ofcourse, I realise that, I dont use anything else myself. But I guess what I mean is, how far is one willing to go with normalisation? The method I mentioned is by far cleaner from a DB admin's point of view, as it takes less rows. From a developers point of view, it complicates queries, especially the JOIN type queries, by an enormous amount.
tbh, for my site I use both. I guess it depends on the application being developed for it.
However something to bear in mind is that if you're designing a database just for web data (eg. content management) usually you want as flat a database as you can get. If you've got a lot of joins in your stored procedures/ SQL queries, it will take more processing power for the server to return the data, thus reducing performance.
Normalisation certainly has it's place, especially for a high maintanence databases and is very popular in internal company systems.
There's a big industry in database denormalisation at the moment. A lot of database developers have been trying their hand at web development and unless they take the time to appreciate the difference between web databases and internal databases and this gan result in performance degeredation.
It all comes down to the individual application. As txbakers points out, if your only other alternative is to create multiple SQL statements / stored procedures, then normalisation is the better of two evils.
If you're writing a high traffic web site that intensively uses a database, think twice before excessively normalising your database.
But I guess what I mean is, how far is one willing to go with normalisation?
True. It occurred to me that normalization in itself is not an ideal. A fully normalized database is the one where you have no NULL values and no duplicated values. It's made up of two-column tables where the first column is the id and the second column is a value. That would be one extreme. The other extreme is put everything into one table.
For example, if the unnormalized db is
I ¦ A ¦ B ¦ C
--+---¦---¦--
1 ¦ r ¦ t ¦ u
1 ¦ r ¦ t ¦ v
2 ¦ e ¦ f ¦ -
3 ¦ k ¦ l ¦ m
I ¦ A
--+---
1 ¦ r
2 ¦ e
3 ¦ k
I ¦ B
--+---
1 ¦ t
2 ¦ f
3 ¦ l
I ¦ C
--+--
1 ¦ u
1 ¦ v
3 ¦ m
I ¦ A ¦ B
--+---¦---
1 ¦ r ¦ t
1 ¦ r ¦ t
2 ¦ e ¦ f
3 ¦ k ¦ l
I ¦ C
--+--
1 ¦ u
1 ¦ v
3 ¦ m
If you're writing a high traffic web site that intensively uses a database, think twice before excessively normalising your database.
This statement can probably be extended to any application where the number of SELECTs by far outweighs the number of INSERTs, UPDATEs and DELETEs. The more SELECTs you have, the more your app will suffer from the extra IO penalty that's incurred by a join (yes, the penalty lies in IO, not CPU).
I still think that for any large scale application - web or not, normalization is important and necessary. You shouldn't let your db design be polluted by performance considerations. That's where object-relational mapping (ORM) tools kick in. With ORMs you design an object model, not a DB model. The ORM then creates the corresponding DB model automatically. Furthermore, the ORM should let you tweak the trade-off between the aforementioned two extremes, i.e. unnormalized and fully normalized backend. Maybe it's even able to determine the optimal level of normalization on its own.
That means all the elements in a row depend only on the key columns.
3NF is sometimes described as meaning the data depends upon "the key, the whole key and nothing but the key".
There are "higher" normal forms but they rarely matter outside of an undergraduate computer science course.
And, as several people have already noted, it is completely normal even among database professionals to back away from the 3NF for performance and tactical reasons.
Normalisation is not the only guiding principle in knocking together data structures that will stand the test of time. For example, there is nothing in normalisation that'll tell me how I should split my database across separate data centers for performance.
Similarly, this post could be held in a database keyed by a unique WM[b][/W] message number. Or the key could be userid+user message number. Normalisation won't settle which is best. But, it will help sift out how other tables should look once that decision has been made.
(But that decision will most probably be reached by looking at performance. Having a single process that issues unique message numbers may be a bottleneck on a high-message-rate board).
Just wanted to add something that may not be obvious to many: The key to determining what level of normalization or denormalization to use is by looking at the queries. In most applications you'll find that the majority of the queries are more or less the same... especially when it's something that called per page view.
Sometimes it's best to denormalize a bit to speed up common queries. And sometimes it can make more sense to run more than one query in a row rather than a single join.
That's why MySQL has not graduated into a real world yet
I'm having a wee smile, I didn't know what this meant until recently.
Thank You, txbakers, for the great explanation!
I've worked with DB's for a long time and normalisation was normal for me in an application environment, until I moved them over to web environment, then it became easier (and faster) to make them flatter (CMS style as someone already said)
I hate myself, it's going against all the grains, but if I were to transer my "normalised" DB's would it then have way too many Stored Procedures (SP's), joins for MySQL to deal with without affecting speed? It was easier with ACCESS (Queries), but apparently it can't stand the same degree of connection loads..
Suzy