Forum Moderators: open
The others are, as motsa says, indirect links.
In the DMOZ structure an editor for the category you cite could edit all sites in that category and the sub category Wellness Centers, plus any new sub categories that might be created.
However that editor could not edit any sites in the indirect @ links (unless they had separate privileges in any of them)
Categories that end in @ are symlinks to categories that are physically located in another section of the directory but that could function as subcategories of this one.
I get exactly 347 with my calculator ;)
The problem is that there are some categories where widgets (or widgets/handcrafted) includes gadgets as a @link, and gadgets includes @widgets. Obviously, trying to total that up would be futile. Now, most editors have been brought to agree that we shouldn't create @loops like that, but in a few cases we _haven't_ been able to agree which should be the parent (related) category, and which should be the @link. Also, it would take a (rather complex) automated process to find all the @link-@loops, and more time to figure out the best way to break them.
People who use part of the RDF need to be aware of issues like this, because many small- or medium-size categories are spread across two or three top-level categories, with @links and related-category links to tie everything together. An extract that doesn't also include @linked categories is going to be painfully deficient.
I have to disagree. Speaking as a programmer, finding all the @ loops isn't really very hard, nor is it overly time intensive. You can do it in one pass through the tree if you design your data structures correctly.
Hey, guys - we are talking about a project that was never meant to grow as large as it is nowadays. Forget usefull structures. It is kept as simple as possible, based on the original structure invented by the founders: directory based structures, data in files in those directories, and so on. No database, no queries, nothing.
And changing to another strucutre? Nearly impossible - better rebuild from scratch. If anyone here ever worked on a true university project tat still is carried on from the old times he should know what I am speaking about - lots of standalone tools [most written in perl, so there are at least no very huge language problems] which you would have to rebuild.
For myself I think counting @links is stupid, because the real number of links is hidden. You would show virtual entries by duplicating numbers, not the real amount of links in your directory.
And for the ease of summing up the links, I can give you a hint where the real difficulty is. Imagine the following very easy example (which is quite common and may be much more difficult in reality. Nobody is perfect and so are the @links people generate)
Main Cat
-- Sub1
-- -- Sub1_1
-- -- -- @Anywhere1/Sub1
-- -- @Anywhere1
-- Sub2
Ok, now tell me how the hell a program should avoid double counting the links in Anywhere1/Sub1 when calculating the amount of sites in Sub1? If you are thinking about it for a while you will be able to generate much more difficult examples. The only way to solve it would lead to a NxN evaluation of all categories. And calculating numbers is done LIVE (which means when regenrating the static HTML-pages that represent the public view of the page), so that is nothing that could be realized.
Not so. In many instances it is actually very misleading to the user if @links are not included.
For example, if you look at hotels
Top: Recreation: Travel: Lodging: Hotels and Motels (37)
Shows there to be 37 hotels in the entire directory. There are in fact 10's of thousands behind @ links to their regional homes.
As the user drills down, even to state level in USA, there is no indication as to how much accommodation is actually included, because the numbers are hidden behind @ links. The counting problem makes the data valueless to the user, as they are likely to avoid drilling down through categories that appear "empty"
Ok, now tell me how the hell a program should avoid double counting the links in Anywhere1/Sub1 when calculating the amount of sites in Sub1?
===========================================================
windharp, it's not as difficult as you're making it out. What you do, is you add a flag to each node in your tree that contains a flag that says "is this node visited already or not". You initialize it to false for your entire tree, and then begin your traversal. For each node that you visit, you sum up the count for it, store it with that node, and then set your "visited flag" to true. If you get to a node that's already been visited, then you already have a count for it, so you don't need to continue.
Using your example tree:
First, you'd hit main category, and you'd start running through it's list of child nodes. You hit "Sub1" first. Inside "Sub1", you come upon "Sub1_1", and then "@Anywhere1/Sub1". At each stop along the way, you mark the node as visited. After you finish "Anywhere1/Sub1", your standard recursive traversal function goes back up the tree to "Sub1_1", which is also finished, so it goes back up to "Sub1", and sees that the next node in the list is "@Anywhere1". It traverses down to "@Anywhere1" and counts it up. When it gets to "@Anywhere1/Sub1", it sees that it's already been visited, so rather than recount it, it simply adds the count that you already got and stored, and then continues on to the next node in the tree.
This is an order-n operation, where n is the number of *LINKS* on your tree (not the number of nodes - important distinction). It's really pretty simple. If you actually have *circular* links (which your example doesn't show), it's a little harder, but not much. This example is actually circular:
Category
-- Sub1
-- -- Sub1_1
-- -- -- @Sub1
This is pretty easy to solve as well, though it is a bit more intensive. If you store the complete path at each level of your tree traversal, all you have to do is check your path and see if your current node is already in it. If it is, simply don't add the count this time. This does turn into an (N * N) calculation, as you say, but it can be *HEAVILY* optimized to cut it down. For instance, if you think about it, you'll realize that I only need to actually check my path when I hit a *link*, not on every node. This immediately drops it to an N(log N) problem.
Also, if I store the complete *real* path with each node, instead of just at each level of the traversal, then I can shorten my comparison even further. First, you check the level in your *search* path that corresponds with the level in your *real* path for that node. So, when I'm processing "@Sub1", then I check level 2, since it's real path is "Category/Sub1". In my above example, this finds it, so I'm done. If that didn't find it, I only have to compare it against the other links, because if it were a "real" node in this search path, my previous check would have found it. So now, we've dropped our calculation to order ((Log N) * (Log N)), which is not a bad calculation at all - and significantly better than (N * N)
After the algorithmic level, you can do some optimization on your data to speed it up, too. For instance, giving each node a unique numerical ID instead of comparing string based paths will give you a huge performance boost. Depending upon the exact implementation, there may be other tricks you can play.
But more importantly than any of what I've said above, there is absolutely no reason to generate this information on the fly. Your argument that it's generated on the fly represents an arbitrary requirement based on the way it's currently done. There's really no good reason to do it that way, since the count would only change if the tree changes, which is going to be a lot less often than it's viewed.
Now that I've written this whole long thing, I think I've hit upon an even better way to do it. If you only calculated the data at insertion time (whenever you add a new node) and then saved the data as part of the tree, I'm pretty sure you could do it in (log N) time.
My argument here isn't that any of this is the way it's currently done - it's just to point out that there's a relatively straightforward way that it *can* be done in a reasonably efficient manner, and the only reason I even bring it up is because several people have already made strong claims that it's very difficult, very slow, or both. As I think I've shown, it doesn't have to be.
I have an unfair advantage here. A couple of jobs ago, I was actually responsible for a program implementing exactly this algorithm -- that is, in a very large digraph, calculate the value for each node based on the value for all nodes pointing to it, in the presence of cycles. (Our graphs were an order of magnitude or so larger than the ODP category set).
I know _exactly_ how hard it is, and I know _exactly_ why the process you propose won't work. We used an algorithm from CALGO, which is in fact linear on the number of nodes+links, but does involve intricate backtracking. You can probably find it in Knuth, or even a graduate-level graph theory textbook.
But that sort of misses what (again, from experience) I meant as the major point. Getting the algorithm working was mildly tricky (i.e. 95% of the programmers there probably couldn't have done it, but most people with a theoretical CompSci background probably could have). But explaining the results to the people responsible for "breaking" the cycles, and showing how the cycles could be broken in the face of belief or evidence that the reality modelled by the graph actually contained cycles, was an ongoing and non-trivial exercise. Every release the algorithm had to be re-run; every release there were newly introduced cycles that had to be analyzed and broken.
Thank you for that lucid explanation, one can almost hear the sweat dripping from your brow as you toiled on the problem.
As one who has always thought that this problem of registering the number of sites, including @ links, could be relatively easily solved. I would now accept that if not insoluble, it is extremely difficult/time consuming to solve, and probably would not repay the effort.