homepage Welcome to WebmasterWorld Guest from 54.163.72.86
register, free tools, login, search, pro membership, help, library, announcements, recent posts, open posts,
Become a Pro Member

Home / Forums Index / Search Engines / Directories
Forum Library, Charter, Moderators: Webwork & skibum

Directories Forum

    
How many sites does Dmoz.org REALLY have.
We can only find 3.347.837, no 3,8 million
Marcos




msg:486637
 3:35 pm on Oct 16, 2002 (gmt 0)

We have been using the complete ODP dump for a few weeks, and we just canīt find more than 3.347.837 sites un there, including duplicates, dead/non-exixting sites, and excluding Netscape cats.
The Dmoz.org front page claims that they have 3.8 sites. Does any body knows what could be wrong here?

 

bird




msg:486638
 3:41 pm on Oct 16, 2002 (gmt 0)

Are you sure the standard RDF dump includes all toplevel categories?
Eg. "Kids & Teens" or the (normally hidden) "Adult"?

caine




msg:486639
 3:48 pm on Oct 16, 2002 (gmt 0)

The 3.8 is relatively new, and may not have been released to the public RDF dump yet - though i certainly could not qualify or quantify this.

Quadrille




msg:486640
 5:18 pm on Oct 16, 2002 (gmt 0)

You'd need to run that check on the same date as the RDF dump to know what the gap is; Also you could add "Test" and "Bookmarks" to the possibles mentioned above.

[edit]and World?[/edit]

[edited by: Quadrille at 5:29 pm (utc) on Oct. 16, 2002]

Marcos




msg:486641
 5:24 pm on Oct 16, 2002 (gmt 0)

>Are you sure the standard RDF dump includes all toplevel
>categories?
>Eg. "Kids & Teens" or the (normally hidden) "Adult"?

Yes, aprently they do, Adult is also there. Only Netscape and Brazil (donīt ask me why) are separate.

>The 3.8 is relatively new, and may not have been released
>to the public RDF dump yet

well, the RDF is supouse to be update in a monthly bases. Maybe they are late or something, but it is a big difference: 450.000 sites are nowhere to be found.

These are the main cats:

Adult
Arts
Business
Computers
Games
Health
Home
News
Recreation
Reference
Regional
Science
Shopping
Society
Sports
Test
World
Private
Kids_and_Teens
Bookmarks
Netscape

What is test and Bookmarks, by the way?

Quadrille




msg:486642
 5:37 pm on Oct 16, 2002 (gmt 0)

Bookmarks are editor bookmarks; categories and sites held by individual editors. These may be sites being monitored, sites held temporarily (under constr, under investigation etc.,) alternative arrangements of sites (demos for re-organizations etc. Or editor's favorites. Many are duplicates of listed sites, but there are exceptions.

Test can include all the above (but not visible to public), but also 'official' projects being undertaken by editors (singly or in groups), plus a 'holding area' for sites being moved for various reasons. Many editor tools use Test categories, and many internal discussions use them as a resource. And More!

I hope I'm not saying too much here!

vmcknight




msg:486643
 5:45 pm on Oct 16, 2002 (gmt 0)

The 3.8 count is just an estimate, put up a couple of months ago when the automatic count went wonky. Since then there has been a Robo run (detection and removal of dead sites), some big category restructuring (which often involves removing duplicate listings) and some categories removed entirely.

So don't worry about it. Maybe when the current server problems are fixed, the automatic count will work again. Then the editors can resume betting on the 4 million mark again. ;)

Marcos




msg:486644
 5:49 pm on Oct 16, 2002 (gmt 0)

Ok, thank you all.

regards,
Marcos

rafalk




msg:486645
 8:08 pm on Oct 16, 2002 (gmt 0)

Both Bookmarks and Test are not included in the RDF dump anymore. Bookmarks are accessible from the DMOZ.org main page while Test is no longer visible to non editors.

There are archived versions out there on the web, so you can go see for yourselves.

g1smd




msg:486646
 9:04 pm on Oct 18, 2002 (gmt 0)


Many of the /World categories (ie the non-English ones) are gradually being converted from various Character Set encodings such as ISO-8859-7, ISO-2022, Shift-JIS, BIG-5, and others to a uniform UTF-8 encoding. There are a LOT of software tweaks, on submission forms, editing screens, public view, and so on to take into account. As there is a big possibility to break something in transit, each category is duplicated in a hidden category for a few weeks while editors try all the features. When all is found to be in order the Public Category is replaced with a newly regenerated and converted version, and the hidden category is deleted. At one point there were several dozen languages going though testing all in parallel. So, several months ago, the directory numbers appeared to receive a boost, and now the work has been done in many categories, the removal of these sees the directory number shrink back to where it was, except for the addition of about 4000 completely new sites into the whole directory each week in the meantime.

Marcos




msg:486647
 1:18 am on Oct 19, 2002 (gmt 0)

Thanks g1smd, but Iīm not sure if Iīm following you. It was a counting error then? While the public pages keep on showing the appropriate chardset, and may confuse any spider, everything has been correctly formatted as UTF-8 at the dumps since 1999.
In order to correctly count the links, you just need to read it from the content.rdf file, and maybe add the few thousand per week added since the last RDF update. But A 450.000 sites error there seems just a little out of order.
Someone pointed that ODP have just "cleaned" the non existing sites, but the true is most of them are still there, many as redirections to links farm.

Anyhow, 3.3 million manually edited pages is more than enought, and the work done by odp editors and DMOZ staff is no doubt very impressive

Global Options:
 top home search open messages active posts  
 

Home / Forums Index / Search Engines / Directories
rss feed

All trademarks and copyrights held by respective owners. Member comments are owned by the poster.
Home ¦ Free Tools ¦ Terms of Service ¦ Privacy Policy ¦ Report Problem ¦ About ¦ Library ¦ Newsletter
WebmasterWorld is a Developer Shed Community owned by Jim Boykin.
© Webmaster World 1996-2014 all rights reserved