Forum Moderators: open
The scenario: users are able to submit a link, and tag it with keywords. The script needs to be able to analyze the keywords (i.e. how many times does a particular keyword appear altogether, how many times does it appear in association with a particular link, etc).
Scheme 1: If I use 1 table with fields: LINK ID, URL, TAGS (space delimited list of tags), it'll take a lot of processing to analyze the words.
Scheme 2: If I use 2 tables, first with fields: LINK ID, URL, and second with fields: TAG ID, LINK ID, TAG - that's a massive increase in db management overhead, and seems like it'd quickly get out of hand with even a relatively small number of tags.
I'm a pretty amatuer php/mysql'er, so I don't really have an intuitive feel for which scheme is more efficient. Any advice would be appreciated. Particularily if it has something to do with a scheme 3. :)
Option 1 is better, but as you say, storing the tags as space delimited strings is hopelessly inefficient query-wise and for statistical analyses.
Why not limit the number of tags allowed and store each tag in its own field in the link table? This might be a suitable compromise.
At some point, you are probably going to have to consider some sort of off-line statistical gathering method if the amount of data becomes too large.
[edited by: FalseDawn at 3:16 pm (utc) on Nov. 17, 2006]
Why not limit the number of tags allowed and store each tag in its own field in the link table? This might be a suitable compromise.
Ahh, hadn't thought of that. It the end, there'd probably be a huge number of null fields though, as most users would probably use only a couple of tags, while the limit would be 8 or so... something I want to avoid.
At some point, you are probably going to have to consider some sort of off-line statistical gathering method if the amount of data becomes too large.
That point made me realize: some latency in tag analysis is ok. It would only have to update every couple hours or so, and it WOULD be easy to do that somewhere besides on the server. Here I was thinking the data would have to be evaluated with every bloody page load...
I think that'll work ok for now.
I wonder how it'd scale up - if the site ever really took off, would I be trying to download a 2GB file for analysis? Well, if it ever got to that point, I imagine I could hire a REAL geek to deal with it. :p
I'm still interested though, what other kinds of schemes are there for dealing with this type of situation? Completely outside the RDBMS box...
Thanks a lot for your post.
Do NOT store tag IDs in n "slots" in the link records! This is one of the worst database structure sins you can commit. Do that, and now you have to iterate over the tags within a record and you have lost most of the power of the relational model.
The relational model was invented to solve this specific problem!
Yes, it could get big fast. You just have to throw yourself at the mercy of caching and optimization.
I'd question the "expert's" opinion mentioned in the original post. Tags map perfectly to RDBMs. They just might take more processing time than you would like, but that is another issue.
#2 from the initial post is the right solution.
....
The relational model was invented to solve this specific problem!
....
Yes, it could get big fast. You just have to throw yourself at the mercy of caching and optimization.
5,000 members could easily generate 500,000 - 1 million entries in the database. Again, I'm pretty new to this, so I don't really have an idea at what point databases become... overloaded (do they ever? is there a software limit to the number of entries in a database?). 500,000 just seems like a very big number to me!
I'd question the "expert's" opinion mentioned in the original post. Tags map perfectly to RDBMs. They just might take more processing time than you would like, but that is another issue.
Well, I dunno if he's an expert, but yahoo DID just buy his site for 30 million...! Not that that says much about his "expertise" I suppose!
Yeah, theoretically, the tags do map well. I guess it just "feels" innefficient in practice.
500,000 or 1,000,000 records is not particularly large. RDBMS indexing is quite efficient, and the number of disk accesses need to locate one record in a database of this size, or 10x this size is minimal. Index caching means that most of the time, there probably will be close to zero disk accesses to locate a record. (And then one to read it.)
Sure, one can always design a better/faster/more efficient sceme to store and retrieve some particular type of data, given details of the nature of the data and relations, typical access patterns, etc. What did the guy in the interview say that they used instead of an RDBMS?
Tags don't map well to SQL. Sometimes you can prune based on usage - only
index the first few pages for example. This keeps indexes small and fast.
I'd estimate that with his site's membership, the tag DB has got to have near 50 million records...
select * from table1 where (tag1 = 'widget' or tag2 = 'widget2' or tag3 = 'widget')
I would modify option 2 from the OP:
TABLE 1:
--------
urlID
url
TABLE 2: (a distinct list of all possible tags)
-------
tagID
tag
TABLE 3:
------
tableID
urlID
tagID
#2 from the original post unnecessarily stores the tag redundantly in each tag-url relation.
You may also want to introduce users or even user groups to the schema. Of course, this multiplies the number of records further. But there are many useful application of keeping track of which user tagged what.
I am particularly intrigued by the idea of being able to put together abritrary "panels" of users (or, call them user groups) and then view a tag cloud that lets you look at things through the eyes of that particular group. Dunno if anyone is doing this, but I have to presume I'm not the first to think of it...
One further refinement would be to allow users to specify their capacity when they tag - either as an individual, or as a member of some panel or panels they belong to. The different is subtle - panels may have specific tagging guidelines, so it might not appropriate to attribute all tags by an individual to the panel simply because the user belongs to it.
I tend to prefer "short, fat" tables to "tall, thin" ones :-)
Bad habit I know, but I'm an old timer...
Hah! Me too. Completely irrational I suppose.
stajer:
I'm not quite clear on the need for a third table. What does it accomplish? The only thing I can think of is that it's faster to search through integers than it is through varchars...
jtara mentioned that it removes tag redundance, but it seems like it just changes the issue from storing the data redundantly to storing a pointer to the data redundantly - what is the advantage?
jtara:
I am particularly intrigued by the idea of being able to put together abritrary "panels" of users (or, call them user groups) and then view a tag cloud that lets you look at things through the eyes of that particular group....
There are all kinds of neat possibilities and potential for emergent intelligence (is that a buzzword yet?). I left it out of the OP, but it IS going to have a user ID (in table 1 of stajer's schema). I hadn't thought about groups.
One further refinement would be to allow users to specify their capacity when they tag - either as an individual, or as a member of some panel or panels they belong to....
I'm not sure I quite understand. Do you mean "capacity" as in "life role" or "areas of interest"? So if someone from the "programmers" group bookmarked a site on basket weaving, it would have less weight in the programmer group's "cloud"?
I have noticed on the site mentioned ealier - most of the data that makes it's way to the top is related to programming/web design; I wonder if the site would grow even faster if that niche was attenuated to let more general interest items to the top more often.
I'm not sure I quite understand. Do you mean "capacity" as in "life role" or "areas of interest"?
I am thinking here in terms of fairly formal group membership.
Let's say Karen is a knitter, and has declared herself as such by checking off "knitting" as one of her interests. She's also been invited to be a member of the "Elite Knitter's Inner Circle Advisory Board".
Now, when Karen tags something, is she doing so in her capacity a knitter, or as a member of the Elite Knitter's Inner Circle Advisory Board? And does she even want her tags regarding the special Thursday Night performance of the Chippendales at the Oak Lawn Bowl to be recorded in either of those capacities?
I think there are multiple overlay views in the same data possible. I'm thinking of this, for example, for a site I am developing that appeals both and in different ways to people who live in a town, and to out-of-towners. You might want to view things alternately as seen by a local, and out-of-towners, those in a particular age group, male vs. female, etc. etc. etc.
Although not of interest to me for the site I'm developing, it also struck me that there are cases where you might want to have things tagged, rated, or both by formal (but ad-hoc) groups. i.e. one might have to be invited to be a member of the group. So, when you vote, rate, or tag, you may or may not want it to "count" for a given group. (And you might have membership in multiple groups.)
There is some precedent to this in website filtering schemes, where in some cases, it is possible to choose from among multiple rating standards.