homepage Welcome to WebmasterWorld Guest from 54.145.243.51
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member
Home / Forums Index / Code, Content, and Presentation / Databases
Forum Library, Charter, Moderator: open

Databases Forum

This 37 message thread spans 2 pages: < < 37 ( 1 [2]     
Normalization - A Brief Introduction
databases, unlike web geeks, need to be normal
txbakers

WebmasterWorld Senior Member txbakers us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 26 posted 3:26 am on Sep 12, 2005 (gmt 0)

In the movie Young Frankenstein Igor is charged with getting a brain for the monster. He drops the brain he was after and instead brings back another one. After the wrong brain was "installed" and it was discovered, the Doctor asked him what brain he brought back, to which Igor replied, "Abby someone. Yes, Abby Normal."

Normal and Abnormal are also terms used to describe the creation of a relational database. A database is said to be normal when all traces of redundant data are eliminated.

Putting the R back in RDMS

The R stands for Relational, as in relative and relates. How many times are we introduced as "Bill's Son" or "Sally's Husband" or "Arnold's Grandson?" Those references to YOU are using a relational model.

Your name might be John, but at school your dad is well known - you are Bill's son. Someone made a relationship between you and your dad.

In a database you might have a table of names:

John Smith
Fred Jones
Carla Bloop

But you also want to store information about those people, their relatives, their addresses hair color, etc. Some data will be unique to that name, but some might not be. Let's take a basic example - names and phone numbers.

NameTable
name, phoneno, type
John Smith, 212-555-1212, home
John Smith, 212-555-1234, cell
John Smith, 212-555-9898, office
Fred Jones, 718-555-1212, home
Fred Jones, 718-222-2222, fax
Fred Jones, 718-111-1111, cell
Fred Jones, 718-333-3333, pager

In that simple example above, you can see that there is redundant data - the name of the person. It would be better to split this into two tables using some type of relation between the tables to identify the data. I'm going to use an ID number for each name so when I have to find the match later, I can do it by numbers rather than letters. Databases prefer numbers.

Names
id, name
101, John Smith
102, Fred Jones

Phones
id, number, type
101, 212-555-1212, home
101, 212-555-1234, cell
101, 212-555-9898, office
102, 718-555-1212, home
102, 718-222-2222, fax
102, 718-111-1111, cell
102, 718-333-3333, pager

Better, but we still have a little redundant data here - the phone type. So, we could add a third table with the type codes:

Phones
id, number, type
101, 212-555-1212, h
101, 212-555-1234, c
101, 212-555-9898, o
102, 718-555-1212, h
102, 718-222-2222, f
102, 718-111-1111, c
102, 718-333-3333, p

Types
typecode, type
h, home
c, cell
o, office
f, fax

Anything else that can be simplified? Or normalized? Perhaps. But I'd be happy with this level of normalization. I can add new types easily. When I add a new person I'm not creating empty space in the first table for the phones and types.

Normalization goes way beyond this simple example. Database Administrators get into fist fights over the level of normalization required. But it is valuable when planning any database project. Planning to be normal from the start will save lots of grief in the future (I speak from personal experience....)

 

jamesa

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 26 posted 12:01 pm on Sep 24, 2005 (gmt 0)

>> Why is that best?

It's not always the right the thing to do but here's a great example, real site:

Normalized DB, indexed properly, lots of records. When the site got traffic it choked. The problem revolved around three types of queries: first would just run forever (had to kill them manually), the second type would take a full 5 minutes to run, the third about a full minute. Site was effectively dead.

After restructuring the db, ALL the queries completed within milliseconds. Might have been an extreme case, but regardless that's a pretty dramatic difference.

This was MySQL 3x, and the WHERE clauses were searching across multiple tables (i.e.- WHERE table1.field >='5' AND table2.field ='x' AND table3.field ='y', etc...

vincevincevince

WebmasterWorld Senior Member vincevincevince us a WebmasterWorld Top Contributor of All Time 10+ Year Member



 
Msg#: 26 posted 9:56 pm on Sep 25, 2005 (gmt 0)

this is a lovely thread, now how about a thread about optimising joins between multiple tables :D

Hanu

10+ Year Member



 
Msg#: 26 posted 7:10 am on Sep 26, 2005 (gmt 0)

I have found the MySQL optimizer regardless of version (3,4 or 5) to be - well, suboptimal. With a fully normalized database, a query involving the longest table (several million records) and two other joined tables would take several hours to complete. Just forcing the optimizer into a different query plan would speed up that query to a couple of seconds.

This just shows that denormalization isn't always necessary. There are other options: optimizing queries, helping the optimzers or even buying a RAID controller (RAM won't help because the bottleneck is IO, not memory or CPU) and fast SCSI discs are preferrable in my opinion as long as I can keep my code and data clean.

aspdaddy

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 26 posted 2:49 pm on Sep 26, 2005 (gmt 0)

Normalised or not - you are in trouble after a million or so records. Big tables need complied views to reduce the scan time.

A query to display one invoice to a customer selecting from 3 yrs data (100K sales PA, 1.5 million rows in invoice-line table after 3 years) optimised by a SQL2K compiled View:

Create View Sales2005Q1
AS
SELECT * FROM dbo.Invoices
WHERE dbo.SalesQtr(dbo.Invoices.InvoiceDate)='2005Q1'
GO

SELECT * FROM Sales2005Q1
Scan time is reduced to 1/12th

Hanu

10+ Year Member



 
Msg#: 26 posted 6:42 am on Sep 28, 2005 (gmt 0)

And as the thread goes slightly off topic, we should always keep in mind wether we're talking about MySQL, SQL Server, Oracle or whatnot. Sometimes that makes a huge difference.

aspdaddy,

I think your query is slow because for every row, the procedure dbo.SalesQtr() is invoked. You might be able to speed it up without having to use a view by doing date comparisons

SELECT * FROM dbo.Invoices
WHERE dbo.SalesQtrStartDate('2005Q1') <= dbo.Invoices.InvoiceDate AND dbo.Invoices.InvoiceDate < dbo.SalesQtrStartDate('2005Q2')

You only need to write the SalesQtrStartDate() procedure which returns the date of the first day in the given quarter.

aspdaddy

WebmasterWorld Senior Member 10+ Year Member



 
Msg#: 26 posted 12:33 pm on Sep 29, 2005 (gmt 0)

The query is fast (not slow:) because it is compiled & cached and selects from less data.

econman

10+ Year Member



 
Msg#: 26 posted 1:25 pm on Sep 29, 2005 (gmt 0)

Great thread.

It would be interesting to hear more thoughts about various strategies that can be used to overcome the performance problems in situations where it is desirable to maintain a high degree of normalization in the underlying DB.

We're still struggling with this problem. We've been experimenting with two techniques:

1. exporting a special denormalized DB to the web server which is used to run the actual website (we call it a data cube)

2. using a content management system's "static export" function to generate flat html pages.

One or the other -- or a combination of both -- of these techniques seems imperative (or perhaps there is a better strategy we haven't thought of?).

We are working with a large data base than really needs a high degree of normalizion for efficiency and maintenance reasons, and yet the web pages we want to publish were impossibly slow when we tried to publish them directly from the DB. They required too many joins, and/or stored procedures, and this couldn't be handled by the MYSQL software we are using.

This 37 message thread spans 2 pages: < < 37 ( 1 [2]
Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Code, Content, and Presentation / Databases
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved