how to convert dmoz format files to database format files

Forum Moderators: open

Message Too Old, No Replies

how to convert dmoz format files to database format files

"well structured dmoz database" development group exists?

cisnet

10:45 am on Jun 17, 2003 (gmt 0)

Hi,
I want to know if it's possible to process the *.rdf files of the dmoz, to save them into a database. I know that there are some perl scripts that can parse these files but I don't want things like these.
My file format can be sqlserver or access or any other popular database file format.

If nobody knows how to do it I'd like to participate in some kind of group who try to extract a well-structured-dmoz-database structure, and all the scripts involved.

Thanks

[edited by: Marcia at 6:42 am (utc) on June 18, 2003]
[edit reason] No emails, please. [/edit]

g1smd

9:35 am on Jul 13, 2003 (gmt 0)

See [rodan.ncc.com...] and pages linked from it for more useful information.

jmccormac

9:56 am on Jul 19, 2003 (gmt 0)

It is not that difficult to do once you can get your head around the ODP structure. (Be prepared for some LD50 coffee sessions though as some of it does not make sense initially.) Unfortunately, I don't think I've seen much in the way of explanations on the ODP structure. (It should be obvious from looking at a Dmoz webpage and its structure. But linking that with the RDF takes a bit of doing.) However the ODP data is in RDF format which means that it can easily be parsed (when it is error-free) into SQL statements.

The process of generating SQL from the rdf files is time consuming in computer terms as it is text parsing. Hence the popularity of ODP software that pulls the results from the ODP rather than locally. Processing the data locally may sound nice but remember you are dealing with an rdf of about 1G2 for content and about 500MB for structure. Downloading and processing files of this magnitude on a regular basis is not an easy task for the ordinary website operator. The database footprint of the complete ODP would be in the region of a few GigaBytes and as a result some people tend to only use small parts of the rdfs.

My own interests were in the Irish and UK sections so it was a simple case of slice and dice to produce smaller starting rdf files. The parsers were then used to provide a set of SQL tables dealing with the content links, categories, editors, news, related categories, structure links, and reviews. Since the end use of the data was for generating static pages, some of the active linking for similar categories and related categories was not implemented in the CMS side. (Actually now that I think about it, a simpler solution for dealing with related categories outside the subset could be to rewrite the category as an ODP link.) The main db I used for these experiments was MySQL. I used this simply because it was the fastest. Rather than going down the Perl route, I used TCL since I write most of my spiders,parsers and CMS programs in that language.

I've been looking at developing an Irish version of the ODP with a lot of back fill from my search engines (I run the biggest Irish search engine but luckily it is a small country ;) ). The good thing about having a search engine is that it is possible to keep the directory up to date by checking the last returned status for the URL from the search engine. I haven't worked on the code for over a year though.

The idea of a file format that is the same across a pile of database types is unusual as most of them have quirks that would require the SQL format to be tweaked since for example, the date is often handled differently across various db formats.

The link that g1smd suggested is a good starting point. The db.sql is a bit light for any real work. However it is a good example of what can be done.

Regards...jmcc

g1smd

11:43 pm on Jul 19, 2003 (gmt 0)

There are other files at [rdf.dmoz.org...] beside the RDF dumps themselves.