RDF Update - Status - (deprecated) Directories forum at WebmasterWorld

Forum Moderators: open

Message Too Old, No Replies

RDF Update - Status

HuhuFruFru

3:55 pm on Jan 18, 2003 (gmt 0)

we know that many experieced editors from the odp are members here at webmasterworld. it would be very nice if they could keep us updated about any news and the current status of the RDF. thank you! :)

victor

4:29 pm on Jan 18, 2003 (gmt 0)

The latest extract was started on 2003-01-14. No further messages posted about its status. That (I hope) means it is still running and hasn't fallen over or been forced to restart.

I can't find off-hand how long it should take.

There is a publicly available status page you could peek at once in a while.

[dmoz.org...]

Practice patience with this page. It may take a while to appear.

Unfortunately I don't know what it all means or how you can gauge how much longer it'll take. Perhaps someone else can explain that.

Right now, the Status reads "Success" which I think means it hasn't found any problems yet.

The metaphorical bottom line is the "RDFs not pushed". When the "not" vanishes that's a brand-new RDF up for grabs.

crunchy cajun

4:37 pm on Jan 18, 2003 (gmt 0)

Argh, some landlubber beat me to the post.

victor

4:55 pm on Jan 18, 2003 (gmt 0)

Argh, some landlubber beat me to the post.

As the manager from The Office (a Brit TV show) once said: "Age and treachery will beat youth and ability every time" :)

g1smd

7:36 pm on Jan 18, 2003 (gmt 0)

An RDF dump usualy takes about 5 days to complete, whether or not it is successful. Sometimes other processes can slow it down a little, and sometimes it exits with an error without completing. Just keep an eye on that page above, as it shows the status of the last RDF that completed; and whether it is OK or not.

HuhuFruFru

11:33 am on Jan 20, 2003 (gmt 0)

there are new tags:

[dmoz.org...]

is this good or bad?

jonrichd

11:38 am on Jan 20, 2003 (gmt 0)

Unfortunately the tags were not pushed, so they will have to correct errors and try again.

Brett_Tabke

11:45 am on Jan 20, 2003 (gmt 0)

Dmoz is going to continue to have these problems and they are going to get worse, not better. It won't get better until they dump the processor intensive bloated RDF format.

They should do a simple csv flat file. It could be generated on-the-fly out of the real db and eliminate this monthly mess.

It would also help promote the ODP since it would be in a file format that most could use and it would be 20-40% smaller in size.

victor

1:53 pm on Jan 20, 2003 (gmt 0)

Great idea Brett!

There appears to be one (exactly one) staff member working on the RDF problem, and it's not her only task even if it is a priority.

So it doesn't seem likely that she's going to switch away fro the bug-fixing to designing a whole other approach.

But if someone could submit the code that spiders the ODP and produces a flatter, smaller, less-likely-to-crash, file that'd be a benefit all round.

Whether the ODP would run it on the Web's behalf, I wouldn't know. But we can hope :)

It wouldn't be so useful run from the outside as it could not be sure that categories aren't being renamed as it runs. (The RDF dump process suspends cat renames for this reason. Which is a reason why a faster dump process will make for happier editors -- less time to wait for reorganizations).

Any one made a start on something like this?

HuhuFruFru

2:39 pm on Jan 20, 2003 (gmt 0)

deleted (internal odp material)

[edited by: HuhuFruFru at 4:44 pm (utc) on Jan. 20, 2003]

Brett_Tabke

2:48 pm on Jan 20, 2003 (gmt 0)

I have a fairly faithful copy of the ODP here on local disk. It was generated out of the rdf dump about 6 months ago. It is basically the odp reconsituted as faithfully as I could.

To transfer that db around the intranet for rebuilding on other disks, I generate a flat file on-the-fly from perl.

The flat file is simple:
@cat�fully qualified category (eg: the relative to root disk directory path)
url�title�description�misc
url�title�description�misc

All I have to look for to reconsitute it elsewhere is the @cat line. Up to the next @cat line are the entries in the db.

That is done on the fly as fast as perl can generate it. Takes about (guess) 25-30mins over ethernet to xfer the entire db and reconstitute it 100% on the fly out of about a 100 lines of cheesy perl script.

To do the same thing out of rdf, would take 8-10hrs to build the rdf and 3-4 to rip it back apart.

HuhuFruFru

3:12 pm on Jan 20, 2003 (gmt 0)

deleted (internal odp material)

[edited by: HuhuFruFru at 4:44 pm (utc) on Jan. 20, 2003]

windharp

3:39 pm on Jan 20, 2003 (gmt 0)

Dont mix up parsing the RDF with generating it. While generating, you pretty much just write lot of text into a file. When parsing you have to search for keywords.

What takes long while parsing a RDF is checking the keywords, not reading all the text.

g1smd

8:56 pm on Jan 20, 2003 (gmt 0)

Their were some hardware changes and some downtime during the last RDF and it had to be restarted (as reported on several fora at the time). From the errors list at [dmoz.org...] it looks like there were a few knock-on effects that caused the latest failure. I'm guessing the RDF will be restarted in a day or two (this has been the way it has been done in the past), and then it is another 5 day wait to see what pops out.

jmccormac

11:00 pm on Jan 21, 2003 (gmt 0)

Dmoz is going to continue to have these problems and they are going to get worse, not better. It won't get better until they dump the processor intensive bloated RDF format.

Getting rid of the processor intensive RDF format would not be a good thing. Unless the csv replacement had some kind of checksum, it could induce more errors. With RDF, the data has to be legal. (Though with some entries, the data is legal but the URL is completely wrong.) Processing the complete RDF is highly processor intensive and it really should only have to be done once for each directory. The more logical solution would be to have a complete (monthly/quarterly) RDF with a weekly diff of data that has changed. Integrating a difference file of sites and entries that were changed would not necessarily be that difficult. However it may well break some of the existing Dmoz dependent software. One of the main arguments in favour of RDF over other file types is that it is far easier to enclose other characters such as ,¦\/#@ etc.

They should do a simple csv flat file. It could be generated on-the-fly out of the real db and eliminate this monthly mess.

I am not sure about the method that Dmoz uses to generate the dumps. However if the dump is being done from an active database then there are some very interesting problems. The duplicate catids seems to indicate that the catid is not unique. If this is so, then there is a very serious flaw in the structure of Dmoz and it is basically creating a lot of the problems that have been seen so far. Thus starting with what may be bad data is not the best way to proceed. Providing CSV dumps/diffs would still not solve the problem of duplicate catids. It looks like Dmoz is generating from a number of sub Dmoz directories rather than one single database. This is the only explanation that I can think of for the duplicate catid problem (apart from a banjaxed schema).

It would also help promote the ODP since it would be in a file format that most could use and it would be 20-40% smaller in size.

The dump/diff model would be a lot more efficient as it would mean that many smaller sites would be able to update their Dmoz databases on a far more regular basis and would not have to download the content/structure RDFs, parse them and integrate them into their local databases. The diff files could be made available in csv format or in RDF or XML.

Taking the dump/diff model a step further, it could be possible to split the Dmoz RDFs into categories/major trees for downloading.

Regards...jmcc

jmccormac

11:15 pm on Jan 21, 2003 (gmt 0)

Dont mix up parsing the RDF with generating it. While generating, you pretty much just write lot of text into a file. When parsing you have to search for keywords.
What takes long while parsing a RDF is checking the keywords, not reading all the text.

Parsing is essentially checking each line (the material between the tags as well as the tags). I am not sure how keywords come into this. In terms of the Dmoz (RDF->SQL) parser/converter I wrote, it was to take each line and convert the content and structure RDFs into a number of MySQL tables. Each line of text had to be read because it had to be analysed. The keywords aspect applies to the Dmoz search rather than to the RDF dumps. (Unless you are using the term 'tag' and 'keyword' interchangably.)

Dmoz's problem (from where I stand) is that they are generating the RDFs from a dodgy database and the errorchecking is picking up these errors. However the errors in the latest seem to be concentrated at the extremes of the catids. This may be a sign that things are getting closer to a resolution. :)

Regards...jmcc