Forum Moderators: open
I can't find off-hand how long it should take.
There is a publicly available status page you could peek at once in a while.
[dmoz.org...]
Practice patience with this page. It may take a while to appear.
Unfortunately I don't know what it all means or how you can gauge how much longer it'll take. Perhaps someone else can explain that.
Right now, the Status reads "Success" which I think means it hasn't found any problems yet.
The metaphorical bottom line is the "RDFs not pushed". When the "not" vanishes that's a brand-new RDF up for grabs.
They should do a simple csv flat file. It could be generated on-the-fly out of the real db and eliminate this monthly mess.
It would also help promote the ODP since it would be in a file format that most could use and it would be 20-40% smaller in size.
There appears to be one (exactly one) staff member working on the RDF problem, and it's not her only task even if it is a priority.
So it doesn't seem likely that she's going to switch away fro the bug-fixing to designing a whole other approach.
But if someone could submit the code that spiders the ODP and produces a flatter, smaller, less-likely-to-crash, file that'd be a benefit all round.
Whether the ODP would run it on the Web's behalf, I wouldn't know. But we can hope :)
It wouldn't be so useful run from the outside as it could not be sure that categories aren't being renamed as it runs. (The RDF dump process suspends cat renames for this reason. Which is a reason why a faster dump process will make for happier editors -- less time to wait for reorganizations).
Any one made a start on something like this?
To transfer that db around the intranet for rebuilding on other disks, I generate a flat file on-the-fly from perl.
The flat file is simple:
@cat¦fully qualified category (eg: the relative to root disk directory path)
url¦title¦description¦misc
url¦title¦description¦misc
All I have to look for to reconsitute it elsewhere is the @cat line. Up to the next @cat line are the entries in the db.
That is done on the fly as fast as perl can generate it. Takes about (guess) 25-30mins over ethernet to xfer the entire db and reconstitute it 100% on the fly out of about a 100 lines of cheesy perl script.
To do the same thing out of rdf, would take 8-10hrs to build the rdf and 3-4 to rip it back apart.
Dmoz is going to continue to have these problems and they are going to get worse, not better. It won't get better until they dump the processor intensive bloated RDF format.
They should do a simple csv flat file. It could be generated on-the-fly out of the real db and eliminate this monthly mess.
It would also help promote the ODP since it would be in a file format that most could use and it would be 20-40% smaller in size.
Taking the dump/diff model a step further, it could be possible to split the Dmoz RDFs into categories/major trees for downloading.
Regards...jmcc
Dont mix up parsing the RDF with generating it. While generating, you pretty much just write lot of text into a file. When parsing you have to search for keywords.
What takes long while parsing a RDF is checking the keywords, not reading all the text.
Parsing is essentially checking each line (the material between the tags as well as the tags). I am not sure how keywords come into this. In terms of the Dmoz (RDF->SQL) parser/converter I wrote, it was to take each line and convert the content and structure RDFs into a number of MySQL tables. Each line of text had to be read because it had to be analysed. The keywords aspect applies to the Dmoz search rather than to the RDF dumps. (Unless you are using the term 'tag' and 'keyword' interchangably.)
Dmoz's problem (from where I stand) is that they are generating the RDFs from a dodgy database and the errorchecking is picking up these errors. However the errors in the latest seem to be concentrated at the extremes of the catids. This may be a sign that things are getting closer to a resolution. :)
Regards...jmcc