|How many sites does Dmoz.org REALLY have.|
We can only find 3.347.837, no 3,8 million
We have been using the complete ODP dump for a few weeks, and we just canīt find more than 3.347.837 sites un there, including duplicates, dead/non-exixting sites, and excluding Netscape cats.
The Dmoz.org front page claims that they have 3.8 sites. Does any body knows what could be wrong here?
Are you sure the standard RDF dump includes all toplevel categories?
Eg. "Kids & Teens" or the (normally hidden) "Adult"?
The 3.8 is relatively new, and may not have been released to the public RDF dump yet - though i certainly could not qualify or quantify this.
You'd need to run that check on the same date as the RDF dump to know what the gap is; Also you could add "Test" and "Bookmarks" to the possibles mentioned above.
[edited by: Quadrille at 5:29 pm (utc) on Oct. 16, 2002]
>Are you sure the standard RDF dump includes all toplevel
>Eg. "Kids & Teens" or the (normally hidden) "Adult"?
Yes, aprently they do, Adult is also there. Only Netscape and Brazil (donīt ask me why) are separate.
>The 3.8 is relatively new, and may not have been released
>to the public RDF dump yet
well, the RDF is supouse to be update in a monthly bases. Maybe they are late or something, but it is a big difference: 450.000 sites are nowhere to be found.
These are the main cats:
What is test and Bookmarks, by the way?
Bookmarks are editor bookmarks; categories and sites held by individual editors. These may be sites being monitored, sites held temporarily (under constr, under investigation etc.,) alternative arrangements of sites (demos for re-organizations etc. Or editor's favorites. Many are duplicates of listed sites, but there are exceptions.
Test can include all the above (but not visible to public), but also 'official' projects being undertaken by editors (singly or in groups), plus a 'holding area' for sites being moved for various reasons. Many editor tools use Test categories, and many internal discussions use them as a resource. And More!
I hope I'm not saying too much here!
The 3.8 count is just an estimate, put up a couple of months ago when the automatic count went wonky. Since then there has been a Robo run (detection and removal of dead sites), some big category restructuring (which often involves removing duplicate listings) and some categories removed entirely.
So don't worry about it. Maybe when the current server problems are fixed, the automatic count will work again. Then the editors can resume betting on the 4 million mark again. ;)
Ok, thank you all.
Both Bookmarks and Test are not included in the RDF dump anymore. Bookmarks are accessible from the DMOZ.org main page while Test is no longer visible to non editors.
There are archived versions out there on the web, so you can go see for yourselves.
Many of the /World categories (ie the non-English ones) are gradually being converted from various Character Set encodings such as ISO-8859-7, ISO-2022, Shift-JIS, BIG-5, and others to a uniform UTF-8 encoding. There are a LOT of software tweaks, on submission forms, editing screens, public view, and so on to take into account. As there is a big possibility to break something in transit, each category is duplicated in a hidden category for a few weeks while editors try all the features. When all is found to be in order the Public Category is replaced with a newly regenerated and converted version, and the hidden category is deleted. At one point there were several dozen languages going though testing all in parallel. So, several months ago, the directory numbers appeared to receive a boost, and now the work has been done in many categories, the removal of these sees the directory number shrink back to where it was, except for the addition of about 4000 completely new sites into the whole directory each week in the meantime.
Thanks g1smd, but Iīm not sure if Iīm following you. It was a counting error then? While the public pages keep on showing the appropriate chardset, and may confuse any spider, everything has been correctly formatted as UTF-8 at the dumps since 1999.
In order to correctly count the links, you just need to read it from the content.rdf file, and maybe add the few thousand per week added since the last RDF update. But A 450.000 sites error there seems just a little out of order.
Someone pointed that ODP have just "cleaned" the non existing sites, but the true is most of them are still there, many as redirections to links farm.
Anyhow, 3.3 million manually edited pages is more than enought, and the work done by odp editors and DMOZ staff is no doubt very impressive