How is ODP structured

Forum Moderators: open

Message Too Old, No Replies

How is ODP structured

byepolar

4:43 am on Mar 17, 2003 (gmt 0)

[dmoz.org...]

I can't seem to figure out how it is structured (e.g. how they come up with those numbers)

Like how do they get the number 360 here;
[dmoz.org...]

There aren't 360 site listings in this cat? And the category #'s don't add up to 360. Where does that # come from?

motsa

5:02 am on Mar 17, 2003 (gmt 0)

If you add up the listings directly in that category and the ones that are in subcategories without an @ after them, you'll find they add up correctly (there's only one "real" subcategory of this one). Categories that end in @ are symlinks to categories that are physically located in another section of the directory but that could function as subcategories of this one.

cornwall

9:33 am on Mar 17, 2003 (gmt 0)

In the case of your example there are 14 sites listed directly in that category, plus a further 346 in the sub category "Wellness Centers" making 360

The others are, as motsa says, indirect links.

In the DMOZ structure an editor for the category you cite could edit all sites in that category and the sub category Wellness Centers, plus any new sub categories that might be created.

However that editor could not edit any sites in the indirect @ links (unless they had separate privileges in any of them)

g1smd

9:29 pm on Mar 17, 2003 (gmt 0)

Additionally: In the minutes after an update to a category, the category above it may not yet have been updated with the new count for the lower category, so it is possible to see a subcategory listed with a certain number of sites supposedly in it, but when you enter that sub-category the number is different, this latter number being the correct one when you manually count the sites listed.

byepolar

11:47 am on Mar 18, 2003 (gmt 0)

There are more than 347 in this category though:
[dmoz.org...]

windharp

12:20 pm on Mar 18, 2003 (gmt 0)

You did notice that United Kingdom@ (135) is just a symlink to another part and of course does not count to the number of entries in that category as motsa said?

Categories that end in @ are symlinks to categories that are physically located in another section of the directory but that could function as subcategories of this one.

I get exactly 347 with my calculator ;)

hutcheson

2:02 pm on Mar 18, 2003 (gmt 0)

You might ask, "why aren't the totals from the @links included?" (Because most of the time, that would give a more valid indication of the real number of listings related to that category.)

The problem is that there are some categories where widgets (or widgets/handcrafted) includes gadgets as a @link, and gadgets includes @widgets. Obviously, trying to total that up would be futile. Now, most editors have been brought to agree that we shouldn't create @loops like that, but in a few cases we _haven't_ been able to agree which should be the parent (related) category, and which should be the @link. Also, it would take a (rather complex) automated process to find all the @link-@loops, and more time to figure out the best way to break them.

People who use part of the RDF need to be aware of issues like this, because many small- or medium-size categories are spread across two or three top-level categories, with @links and related-category links to tie everything together. An extract that doesn't also include @linked categories is going to be painfully deficient.

Rhadamanthus

4:36 pm on Mar 18, 2003 (gmt 0)

Also, it would take a (rather complex) automated process to find all the @link-@loops, and more time to figure out the best way to break them.
===========================================================

I have to disagree. Speaking as a programmer, finding all the @ loops isn't really very hard, nor is it overly time intensive. You can do it in one pass through the tree if you design your data structures correctly.

rfgdxm1

4:42 pm on Mar 18, 2003 (gmt 0)

True Rhadamanthus. All it would take is one pass through the tree, checking every @link and following where it went to see if there was a loop.

Rhadamanthus

5:15 pm on Mar 18, 2003 (gmt 0)

Exactly.

g1smd

2:19 am on Mar 19, 2003 (gmt 0)

There already is an editor tool to find incestuous symlinks, and to do other analysis of such links. I use the "who links to me" tool quite a lot as well.

windharp

9:00 am on Mar 19, 2003 (gmt 0)

"if you would design your data structures properly"

Hey, guys - we are talking about a project that was never meant to grow as large as it is nowadays. Forget usefull structures. It is kept as simple as possible, based on the original structure invented by the founders: directory based structures, data in files in those directories, and so on. No database, no queries, nothing.

And changing to another strucutre? Nearly impossible - better rebuild from scratch. If anyone here ever worked on a true university project tat still is carried on from the old times he should know what I am speaking about - lots of standalone tools [most written in perl, so there are at least no very huge language problems] which you would have to rebuild.

For myself I think counting @links is stupid, because the real number of links is hidden. You would show virtual entries by duplicating numbers, not the real amount of links in your directory.

And for the ease of summing up the links, I can give you a hint where the real difficulty is. Imagine the following very easy example (which is quite common and may be much more difficult in reality. Nobody is perfect and so are the @links people generate)

Main Cat
-- Sub1
-- -- Sub1_1
-- -- -- @Anywhere1/Sub1
-- -- @Anywhere1
-- Sub2

Ok, now tell me how the hell a program should avoid double counting the links in Anywhere1/Sub1 when calculating the amount of sites in Sub1? If you are thinking about it for a while you will be able to generate much more difficult examples. The only way to solve it would lead to a NxN evaluation of all categories. And calculating numbers is done LIVE (which means when regenrating the static HTML-pages that represent the public view of the page), so that is nothing that could be realized.

cornwall

10:09 am on Mar 19, 2003 (gmt 0)

>>For myself I think counting @links is stupid, because the real number of links is hidden

Not so. In many instances it is actually very misleading to the user if @links are not included.

For example, if you look at hotels
Top: Recreation: Travel: Lodging: Hotels and Motels (37)
Shows there to be 37 hotels in the entire directory. There are in fact 10's of thousands behind @ links to their regional homes.

As the user drills down, even to state level in USA, there is no indication as to how much accommodation is actually included, because the numbers are hidden behind @ links. The counting problem makes the data valueless to the user, as they are likely to avoid drilling down through categories that appear "empty"

choster

2:53 pm on Mar 19, 2003 (gmt 0)

The Google directory counts contents of categories one level deep. The Netscape and AT&T directories treats subcategories and @links as listings, like Yahoo. AskJeeves does not provide site counts. So long as the results are reasonably consistent across categories, IMHO it is somewhat academic to speak of a correct or incorrect counting mechanism-- well maybe the AOL search directory, now that is out of whack. After all, few everyday users browse dmoz.org.

Rhadamanthus

4:15 pm on Mar 19, 2003 (gmt 0)

Main Cat
-- Sub1
-- -- Sub1_1
-- -- -- @Anywhere1/Sub1
-- -- @Anywhere1
-- Sub2

Ok, now tell me how the hell a program should avoid double counting the links in Anywhere1/Sub1 when calculating the amount of sites in Sub1?
===========================================================

windharp, it's not as difficult as you're making it out. What you do, is you add a flag to each node in your tree that contains a flag that says "is this node visited already or not". You initialize it to false for your entire tree, and then begin your traversal. For each node that you visit, you sum up the count for it, store it with that node, and then set your "visited flag" to true. If you get to a node that's already been visited, then you already have a count for it, so you don't need to continue.

Using your example tree:

First, you'd hit main category, and you'd start running through it's list of child nodes. You hit "Sub1" first. Inside "Sub1", you come upon "Sub1_1", and then "@Anywhere1/Sub1". At each stop along the way, you mark the node as visited. After you finish "Anywhere1/Sub1", your standard recursive traversal function goes back up the tree to "Sub1_1", which is also finished, so it goes back up to "Sub1", and sees that the next node in the list is "@Anywhere1". It traverses down to "@Anywhere1" and counts it up. When it gets to "@Anywhere1/Sub1", it sees that it's already been visited, so rather than recount it, it simply adds the count that you already got and stored, and then continues on to the next node in the tree.

This is an order-n operation, where n is the number of *LINKS* on your tree (not the number of nodes - important distinction). It's really pretty simple. If you actually have *circular* links (which your example doesn't show), it's a little harder, but not much. This example is actually circular:

Category
-- Sub1
-- -- Sub1_1
-- -- -- @Sub1

This is pretty easy to solve as well, though it is a bit more intensive. If you store the complete path at each level of your tree traversal, all you have to do is check your path and see if your current node is already in it. If it is, simply don't add the count this time. This does turn into an (N * N) calculation, as you say, but it can be *HEAVILY* optimized to cut it down. For instance, if you think about it, you'll realize that I only need to actually check my path when I hit a *link*, not on every node. This immediately drops it to an N(log N) problem.

Also, if I store the complete *real* path with each node, instead of just at each level of the traversal, then I can shorten my comparison even further. First, you check the level in your *search* path that corresponds with the level in your *real* path for that node. So, when I'm processing "@Sub1", then I check level 2, since it's real path is "Category/Sub1". In my above example, this finds it, so I'm done. If that didn't find it, I only have to compare it against the other links, because if it were a "real" node in this search path, my previous check would have found it. So now, we've dropped our calculation to order ((Log N) * (Log N)), which is not a bad calculation at all - and significantly better than (N * N)

After the algorithmic level, you can do some optimization on your data to speed it up, too. For instance, giving each node a unique numerical ID instead of comparing string based paths will give you a huge performance boost. Depending upon the exact implementation, there may be other tricks you can play.

But more importantly than any of what I've said above, there is absolutely no reason to generate this information on the fly. Your argument that it's generated on the fly represents an arbitrary requirement based on the way it's currently done. There's really no good reason to do it that way, since the count would only change if the tree changes, which is going to be a lot less often than it's viewed.

Now that I've written this whole long thing, I think I've hit upon an even better way to do it. If you only calculated the data at insertion time (whenever you add a new node) and then saved the data as part of the tree, I'm pretty sure you could do it in (log N) time.

My argument here isn't that any of this is the way it's currently done - it's just to point out that there's a relatively straightforward way that it *can* be done in a reasonably efficient manner, and the only reason I even bring it up is because several people have already made strong claims that it's very difficult, very slow, or both. As I think I've shown, it doesn't have to be.

hutcheson

6:21 pm on Mar 19, 2003 (gmt 0)

>My argument here isn't that any of this is the way it's currently done - it's just to point out that there's a relatively straightforward way that it *can* be done in a reasonably efficient manner, and the only reason I even bring it up is because several people have already made strong claims that it's very difficult, very slow, or both. As I think I've shown, it doesn't have to be.

I have an unfair advantage here. A couple of jobs ago, I was actually responsible for a program implementing exactly this algorithm -- that is, in a very large digraph, calculate the value for each node based on the value for all nodes pointing to it, in the presence of cycles. (Our graphs were an order of magnitude or so larger than the ODP category set).

I know _exactly_ how hard it is, and I know _exactly_ why the process you propose won't work. We used an algorithm from CALGO, which is in fact linear on the number of nodes+links, but does involve intricate backtracking. You can probably find it in Knuth, or even a graduate-level graph theory textbook.

But that sort of misses what (again, from experience) I meant as the major point. Getting the algorithm working was mildly tricky (i.e. 95% of the programmers there probably couldn't have done it, but most people with a theoretical CompSci background probably could have). But explaining the results to the people responsible for "breaking" the cycles, and showing how the cycles could be broken in the face of belief or evidence that the reality modelled by the graph actually contained cycles, was an ongoing and non-trivial exercise. Every release the algorithm had to be re-run; every release there were newly introduced cycles that had to be analyzed and broken.

g1smd

9:27 pm on Mar 19, 2003 (gmt 0)

Since the number of sites deeper in the tree are added together and displayed as a total of all sites at this level and below, if you also included all the sites in the @links, by the time the totals propogated to the top level of each major branch you would end up with the total of all branches actually being at least twice as big as the actual number of sites in the directory. The only way to not count sites twice is to only count sites that really are physically in that branch; and that is how it is done at the moment.

cornwall

9:59 pm on Mar 19, 2003 (gmt 0)

hutcheson

Thank you for that lucid explanation, one can almost hear the sweat dripping from your brow as you toiled on the problem.

As one who has always thought that this problem of registering the number of sites, including @ links, could be relatively easily solved. I would now accept that if not insoluble, it is extremely difficult/time consuming to solve, and probably would not repay the effort.