Forum Moderators: coopster
Background:
The site in question involves widgets (how original!). There is a bunch of information stored in the database about each widget (name, model #, region, bubbles-per-hour, category). On the site, users can browse the information by selecting one of several sorting schemes (alpha by name, alpha by region, etc). One such sorting scheme is a category based one. Users can click a category name and the page builds a list of links to the widgets in that category.
Implementation:
Here's what I'm currently working with. All information is stored in one table. For sorting by name, listing the contents is easy and fast: simply grab all the rows, alphabetize the name strings, and output the list. Same for alpha by region.
Sorting by category is somewhat more complicated. Here I have to pull all the rows, then loop through them, searching the 'categories' field in each entry for a match in the comma seperated string to the user selected category.
Since each widget can fall into one or more categories (If each widget had only one category, this would be easy. Just build a table for each category and output all the contents.), this ends up being a lot of searching. At a few hundred widgets, this is no big deal. But when numbers reach into the thousands, or, heaven forbid, the tens of thousands, searching every 'categories' field in the DB will quickly become unweildy.
A Solution (My Question)?
After reading about normalizing databases, I came up with the following 'plan'. What I don't know is whether this is really any better than the situation described above. I was hoping someone with more DB experience than myself might be able to tell me whether this will prove to be a more efficient way to organize and retrieve the data...
Remove the 'categories' column from the widget_info table and create a new table called widget_categories. Name the columns in this table according to the category names used on the page. For each widget entered into the DB, create an identical unique ID in each table. Stuff the widget info into widget_info, and set a boolean TRUE in any column of widget_categories that matches the categories of the widget.
When the user selects a category to sort widgets by, the script goes to the widget_categories table and searches for rows where that category is set to TRUE. It then takes all of the IDs that match the search, and pulls THOSE rows from widget_info to build the item list.
The Question:
Again, what I'm not sure of is whether or not this search (of widget_categories for TRUE in a given column of each row) will be faster and more efficient than pulling the string of category names from the column in widget_info and searching for a match. My suspicion is that it would be, but my experience with mySQL is limited and my knowledge of how databases run on a programmatic level is practically non-existent. Implementing this change will involve quite a bit of re-coding, so I want to make sure I've mapped out the most efficient means possible before starting to write the code. (Which means any other suggestions for increasing the speed of the DB interaction would be welcomed and appreciated!)
Also, I know you can set fields in mySQL to NULL/NOT NULL, but is there a way to set the data type for a field to a boolean (true, false) value?
Thanks in advance for any responses.
cEM
One efficient and flexible n-to-n normalization implementation is a table with two columns, widgetid and categoryid. The primary key is both columns, and if a widget has multiple categories, there's one row for each. You can also look at it backwards, if a category has multiple widgets, there's a row for each widget-to-category pairing.
You can join this table into your query, then filter and sort on the categoryid. It's flexible and fast.
For completeness, you'd probably want a categories table, with categoryid and the name of the category. If you keep the n-to-n table purely integer IDs, you will maximize speed on searching and joins. Be sure to index it backwards and forwards so that you can quickly answer both questions: "What widgets are in this category?" and "What categories is this widget in?".
You can also speed up what you have now by searching the category list in MySQL. The function FIND_IN_SET will pick out one number from a comma-seperated list, and you can use it in your where clause. It doesn't scale well though, because each value needs to be searched by "hand". It can't use indexes to speed up the search.
Assume for a moment, though, that I'm a complete moron (not so far from the truth). Anyway you could go back and explain this...
One efficient and flexible n-to-n normalization implementation is a table with two columns, widgetid and categoryid. The primary key is both columns, and if a widget has multiple categories, there's one row for each. You can also look at it backwards, if a category has multiple widgets, there's a row for each widget-to-category pairing.
...in somewhat simpler terms? Specifically, I don't know what 'n-to-n' means. I'm assuming if I understood that, I would understand what use a table with a widgetid and columnid would be, but at the moment I'm not following.
Sorry to trouble you, but could you take a moment to explain it out? And if possible, an example that shows how a search on such a database would work?
Thanks,
cEM
The difference between "1-to-n" and "n-to-n" is sometimes slim. In this case, you might say that "1 widget has any number of categories", but a different point of view might lead you to say that "a category can contain any number of widgets." Conveniently, the usual database implementation of 1-to-n and n-to-n relationships is exactly the same, so it doesn't matter in practice.
So to implement an n-to-n, you use a two-column table that joins the two things. Some people call this technique a "join table", as it usually serves no function outside bringing the things together. Given tables Widgets and Categories, the joining table might be called Widgets2Categories, WidgetCategoryJoin, Widget_Category, etc.
An example:
CREATE TABLE Widgets ( widgetid INTEGER, name VARCHAR(64) ...)
CREATE TABLE Categories ( categoryid INTEGER, name VARCHAR(64) ...)
then
CREATE TABLE WidgetCategories ( widgetid INTEGER, categoryid INTEGER, PRIMARY KEY (widgetid, categoryid), INDEX (categoryid, widgetid))
Display all widgets in category 2, sorted by name:
SELECT Widgets.widgetid, name FROM WidgetCategories, Widgets WHERE categoryid=2 AND WidgetCategory.widgetid=Widgets.widgetid ORDER BY name
Retrieve categories for widget 23:
SELECT categoryid FROM WidgetsCategories WHERE widgetid=23
Fancy- display all widgets by category, sorted by category name then widget name:
SELECT Widgets.widgetid, Widgets.name FROM WidgetCategories, Widgets, Categories WHERE WidgetCategories.categoryid = Categories.categoryid AND WidgetCategories.widgetid=Widgets.widgetid ORDER BY Categories.name, Widgets.name
Updating a widget's categories (lazy way):
$new_cats = array(2,4,5,6);
mysql_do('DELETE FROM WidgetCategories WHERE widgetid=12');
foreach ($new_cats as $categoryid) {
mysql_do('INSERT INTO WidgetCategories (widgetid, categoryid) VALUES (12, ' . $categoryid . ')');
}
[I use the PEAR DB class instead of the raw mysql functions, so please excuse my made-up php mysql function syntax. :) ]
jollymcfats, that's amazing. Thank you.
If I'm following the basics here: each widget gets an ID number and a name. Each category gets an ID number and a name. In the join table, these ID numbers match the corresponding widgets and categories in those tables. Then to access and sort the names, you used the join table to cross-index. Is this faster because it's searching a table full of integers rather than strings?
I have to confess to a touch more ignorance here and ask what this...
Widgets.widgetid, name
...means, exactly. I'm very close to following the language in your queries but I've never queried multiple tables before and don't know exactly what your syntax says.
Thanks again for taking the time to explain. That last post was a real eye opener.
cEM
In the join table, these ID numbers match the corresponding widgets and categories in those tables. Then to access and sort the names, you used the join table to cross-index. Is this faster because it's searching a table full of integers rather than strings?
You could build the same cross-index using widget name and category name instead, but it would be orders of magnitude slower.
An integer comparison can be performed in a single step (single CPU instruction, even); generally speaking, a string comparison requires such a comparison for each character until a match is found. MySQL can even answer some queries using integer columns (like this join) directly from the index, without having to open and consult the table.
The database is happiest when it can join together relationships with numbers. Usually the things modeled in the database don't naturally have a unique number- say, a "category": a category is just an idea with a name- but most databases have a handy feature called an "identity column" to supply numbers to things that could benefit from having one. In MySQL this is an AUTO_INCREMENT column, a magic column that is automatically filled in with an incrementing number on each insert. The first row gets a 1, the second gets a 2, etc. These numbers don't mean anything to you or I, they are solely for the database to use for quick access to rows.
If you have
CREATE TABLE Categories ( categoryid INTEGER AUTO_INCREMENT, name VARCHAR(64) )and load it up with
INSERT INTO Categories (name) VALUES ('Fast Bubblers'), you'd find that SELECT name FROM Categories WHERE categoryid=1will return 'Fast Bubblers'.
So anyhow, my point here is that numbers are fast, and with AUTO_INCREMENT it's easy to add them to your database. You can even do an
ALTER TABLE Widgets ADD widgetid INTEGER AUTO_INCREMENTto your existing Widgets table and you'll find that the existing rows have been numbered for you, and new widgets will also get a number.
I have to confess to a touch more ignorance here and ask what...Widgets.widgetid, name...means, exactly.
In a "SELECT fieldname FROM" MySQL query, "fieldname" is really shorthand. The full syntax is "SELECT databasename.tablename.fieldname FROM", but MySQL fills in as much of that as possible.
The "databasename" is assumed from which database you connect to. If only one table is present in FROM, that tablename is assumed. If multiple tablenames are present in FROM, the tablename will also be assumed, so long as the fieldname is unique. If it is present in multiple tables, as widgetid is above, you need to qualify the fieldname to tell the database which table's widgetid you want. They're both equal in this example, so it doesn't matter which you pick.
Say I these are my two tables:
Widgets
widget_id---name---region
1-----------blue----north
2-----------red-----south
3-----------green-----east
Category
cat_id--- name
1---------big
2---------little
3---------limited edition
In my old Widgets table, where category was a column, if row 1 was:
1----blue----north----big, limited edition
How do I now indicate the two categories (big and limited edition) into which the 'blue' widget falls? How does it enter into Category and/or the join table WidgetCategory?
One efficient and flexible n-to-n normalization implementation is a table with two columns, widgetid and categoryid. The primary key is both columns, and if a widget has multiple categories, there's one row for each.
In the quote above, you said if a widget falls into more than one category, give each one a row. Does that mean the WidgetCategory table looks like this, with two entries for the 'blue' widget?
WidgetCategory
widget_id-------cat_id
1---------------1 (=big)
1---------------3 (=limited edition)
Again, thanks for taking the time (and having the patience) to help.
cEM
I haven't seen the second edition, but the net folk seem to think it's still very good.