|Database Organization for Speed|
50,000 rows with 200 columns vs 200,000 rows with 50 columns
| 4:03 am on Sep 2, 2011 (gmt 0)|
I'm not much of a SQL guy, so I thought I'd ask you guys here.
I've got a huge database I've been handed which has 50,000 rows with 200 columns. Interestingly after the first couple of columns each one is 4 sequential repeats of each other, for example:
ProductID/Brand/Type/ followed by SmlHeight/SmlWidth/SmlWeight/MedHeight/MedWidth/MedWeight/LrgHeight/LrgWidth/LrgWeight/etc
Imagine that, but in 4 groups of 50 totaling 200. They're all just Height/Width/Weight broken down into 4 versions of each other.
Running speed queries with my 200 column table it doesn't seem to matter if there's 50,000 rows or 5,000 rows, they all seem to be about the same speed. However, if I reduce the columns from 200 to say 50 then the query speed is significantly sped up.
So, is there any reason it's a bad idea to just chop those up into 4 slightly repeating rows? I.E.
Even though probably 10 columns (ProductID/Brand/Type/etc) would all be exact duplicates, the rest are unique unto that row.
I guess my question really breaks down into this; which of these is the best option?
A) 50,000 rows with 200 columns
B) 200,000 rows with 50 slightly repeating columns
C) 4 separate tables with 50,000 rows and 50 columns
Simply chopping up this database into parts is probably pretty easy, anything more than that is certainly beyond my skill. This isn't a large company or a huge website with millions of visitors, so I really just need whatever is the smart and quick choice.
Any help would be appreciated. Thanks!
| 5:19 am on Sep 7, 2011 (gmt 0)|
Firstly, you should not have problems with a 50,000 row database unless you have VERY high performance requirements. Is there really a problem?
You also need to be careful not to lose important functionality. The current database probably has a unique index on the product ID column to prevent duplicates, you meed a size type field and a unique index on both columns. If all the dimensions are required for each product then that will be harder to enforce.
Duplicating information is bad design, and should only be done when necessary ("denormalisation").
What you could try is five tables (a product table and four dimensions table, each with a foreign key on product) or two tables (product table and dimensions, each row has a foreign key on product and a dimension type of small/large/etc.), or two tables with a single product row in each.
The best option will depend on:
1) What RDBMS you use: for example MySQL with MyIASM tables does not enforce foreign keys
2) Whether the database is read heavy or has a lot of writes: adding indexes make reading faster and writing slower.
3) Whether you need to be able to query by all rows, or whether some are just looked up as necessary.
| 8:39 am on Sep 7, 2011 (gmt 0)|
Typo: 3) should say "query by all columns", not rows.
| 2:44 am on Sep 8, 2011 (gmt 0)|
This is good advice, I wasn't even thinking about the unique index problem of duplicating. Apologies for being such a noob when it comes to this stuff :) Thanks so much!
| 6:47 am on Sep 8, 2011 (gmt 0)|
Glad to help. Are you going to keep the database as it is, or are you changing it?
| 4:19 pm on Sep 8, 2011 (gmt 0)|
I've decided to split it into multiple tables. I figure the default page items (the first 50 columns in this scenario) are going to be 95% of the queries with the other 5% being the remaining 150 columns as people click through various options. So it seems to make the most sense to move those other columns out of there and just make them accessible when needed, linking them by the one key. I thought that would make generating the queries harder, but I've been looking into it and it seems it's really not that difficult at all :) Thanks again for your help!