Welcome to WebmasterWorld Guest from

Forum Moderators: open

Message Too Old, No Replies

Normalization - A Brief Introduction

databases, unlike web geeks, need to be normal

3:26 am on Sep 12, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member txbakers is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Sept 1, 2001
votes: 0

In the movie Young Frankenstein Igor is charged with getting a brain for the monster. He drops the brain he was after and instead brings back another one. After the wrong brain was "installed" and it was discovered, the Doctor asked him what brain he brought back, to which Igor replied, "Abby someone. Yes, Abby Normal."

Normal and Abnormal are also terms used to describe the creation of a relational database. A database is said to be normal when all traces of redundant data are eliminated.

Putting the R back in RDMS

The R stands for Relational, as in relative and relates. How many times are we introduced as "Bill's Son" or "Sally's Husband" or "Arnold's Grandson?" Those references to YOU are using a relational model.

Your name might be John, but at school your dad is well known - you are Bill's son. Someone made a relationship between you and your dad.

In a database you might have a table of names:

John Smith
Fred Jones
Carla Bloop

But you also want to store information about those people, their relatives, their addresses hair color, etc. Some data will be unique to that name, but some might not be. Let's take a basic example - names and phone numbers.

name, phoneno, type
John Smith, 212-555-1212, home
John Smith, 212-555-1234, cell
John Smith, 212-555-9898, office
Fred Jones, 718-555-1212, home
Fred Jones, 718-222-2222, fax
Fred Jones, 718-111-1111, cell
Fred Jones, 718-333-3333, pager

In that simple example above, you can see that there is redundant data - the name of the person. It would be better to split this into two tables using some type of relation between the tables to identify the data. I'm going to use an ID number for each name so when I have to find the match later, I can do it by numbers rather than letters. Databases prefer numbers.

id, name
101, John Smith
102, Fred Jones

id, number, type
101, 212-555-1212, home
101, 212-555-1234, cell
101, 212-555-9898, office
102, 718-555-1212, home
102, 718-222-2222, fax
102, 718-111-1111, cell
102, 718-333-3333, pager

Better, but we still have a little redundant data here - the phone type. So, we could add a third table with the type codes:

id, number, type
101, 212-555-1212, h
101, 212-555-1234, c
101, 212-555-9898, o
102, 718-555-1212, h
102, 718-222-2222, f
102, 718-111-1111, c
102, 718-333-3333, p

typecode, type
h, home
c, cell
o, office
f, fax

Anything else that can be simplified? Or normalized? Perhaps. But I'd be happy with this level of normalization. I can add new types easily. When I add a new person I'm not creating empty space in the first table for the phones and types.

Normalization goes way beyond this simple example. Database Administrators get into fist fights over the level of normalization required. But it is valuable when planning any database project. Planning to be normal from the start will save lots of grief in the future (I speak from personal experience....)

12:01 pm on Sept 24, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:Oct 15, 2002
votes: 0

>> Why is that best?

It's not always the right the thing to do but here's a great example, real site:

Normalized DB, indexed properly, lots of records. When the site got traffic it choked. The problem revolved around three types of queries: first would just run forever (had to kill them manually), the second type would take a full 5 minutes to run, the third about a full minute. Site was effectively dead.

After restructuring the db, ALL the queries completed within milliseconds. Might have been an extreme case, but regardless that's a pretty dramatic difference.

This was MySQL 3x, and the WHERE clauses were searching across multiple tables (i.e.- WHERE table1.field >='5' AND table2.field ='x' AND table3.field ='y', etc...

9:56 pm on Sept 25, 2005 (gmt 0)

Senior Member from MY 

WebmasterWorld Senior Member vincevincevince is a WebmasterWorld Top Contributor of All Time 10+ Year Member

joined:Apr 1, 2003
votes: 0

this is a lovely thread, now how about a thread about optimising joins between multiple tables :D
7:10 am on Sept 26, 2005 (gmt 0)

Preferred Member

10+ Year Member

joined:Dec 8, 2003
votes: 0

I have found the MySQL optimizer regardless of version (3,4 or 5) to be - well, suboptimal. With a fully normalized database, a query involving the longest table (several million records) and two other joined tables would take several hours to complete. Just forcing the optimizer into a different query plan would speed up that query to a couple of seconds.

This just shows that denormalization isn't always necessary. There are other options: optimizing queries, helping the optimzers or even buying a RAID controller (RAM won't help because the bottleneck is IO, not memory or CPU) and fast SCSI discs are preferrable in my opinion as long as I can keep my code and data clean.

2:49 pm on Sept 26, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 13, 2002
votes: 0

Normalised or not - you are in trouble after a million or so records. Big tables need complied views to reduce the scan time.

A query to display one invoice to a customer selecting from 3 yrs data (100K sales PA, 1.5 million rows in invoice-line table after 3 years) optimised by a SQL2K compiled View:

Create View Sales2005Q1
SELECT * FROM dbo.Invoices
WHERE dbo.SalesQtr(dbo.Invoices.InvoiceDate)='2005Q1'

SELECT * FROM Sales2005Q1
Scan time is reduced to 1/12th

6:42 am on Sept 28, 2005 (gmt 0)

Preferred Member

10+ Year Member

joined:Dec 8, 2003
votes: 0

And as the thread goes slightly off topic, we should always keep in mind wether we're talking about MySQL, SQL Server, Oracle or whatnot. Sometimes that makes a huge difference.


I think your query is slow because for every row, the procedure dbo.SalesQtr() is invoked. You might be able to speed it up without having to use a view by doing date comparisons

SELECT * FROM dbo.Invoices
WHERE dbo.SalesQtrStartDate('2005Q1') <= dbo.Invoices.InvoiceDate AND dbo.Invoices.InvoiceDate < dbo.SalesQtrStartDate('2005Q2')

You only need to write the SalesQtrStartDate() procedure which returns the date of the first day in the given quarter.

12:33 pm on Sept 29, 2005 (gmt 0)

Senior Member

WebmasterWorld Senior Member 10+ Year Member

joined:June 13, 2002
votes: 0

The query is fast (not slow:) because it is compiled & cached and selects from less data.
1:25 pm on Sept 29, 2005 (gmt 0)

Preferred Member

10+ Year Member

joined:Mar 1, 2004
votes: 3

Great thread.

It would be interesting to hear more thoughts about various strategies that can be used to overcome the performance problems in situations where it is desirable to maintain a high degree of normalization in the underlying DB.

We're still struggling with this problem. We've been experimenting with two techniques:

1. exporting a special denormalized DB to the web server which is used to run the actual website (we call it a data cube)

2. using a content management system's "static export" function to generate flat html pages.

One or the other -- or a combination of both -- of these techniques seems imperative (or perhaps there is a better strategy we haven't thought of?).

We are working with a large data base than really needs a high degree of normalizion for efficiency and maintenance reasons, and yet the web pages we want to publish were impossibly slow when we tried to publish them directly from the DB. They required too many joins, and/or stored procedures, and this couldn't be handled by the MYSQL software we are using.

This 37 message thread spans 2 pages: 37

Join The Conversation

Moderators and Top Contributors

Hot Threads This Week

Featured Threads

Free SEO Tools

Hire Expert Members