Forum Moderators: open
I am new to ODP and I want to write something that parses and store data from both files, structure.rdf.u8.gz and content.rdf.u8.gz, into either MySQL or MS SQL. However, I am having a heck of a time understanding what certain tags are used for in relation to rendering the dmoz directories (i.e. in the structure file: narrow, narrow1, narrow2, symbolic, symbolic1, symbolic2 and in the content file: link, link1, mediadate, priority and ages)
Please help out as much as you can. This stuff is not documented at all and I have not seen sample code or articles online. If anyone can help, that would be greatly apprciated.
Thanks!
These sub categories and @links are sorted into 3 levels with 2 being highest, 1 being the middle, and no number being the lowest. If you look at most top level categories you can see what I mean.
So in the RDF
narrow2 sorts into the top group, narrow 1 into the next, and narrow into the bottom.
symbolic2, symbolic1, and symbolic then sort similarly.
Note: narrow and symbolic tags are sorted together - so the top group contains all the narrow2 AND symbolic2 links.
Regarding the others, mediadate will only appear if a URL has a mediadate set, which would be for a copy of an article in a newspaper, or similar. Ages is a tag again that will only appear if one is used in the directory, where it is set to relate to the age of the children the site is suitable for (used only in Kids and Teens).
Not sure about the others without more research.
[dmoz.org...] also is now quite dated (jan 1999), but shows the top structure.
Ciao
In this example, you want the Top/Regional/Europe/Ireland slice. That has a star of Topic header of
'<Topic r:id="Top/Regional/Europe/Ireland '. Only part of the RDF tag is used as a trigger. It is written in pseudo code.
Start:
$topicstub ="<Topic r:id="
$topic ="<Topic r:id=\"Top/Regional/Europe/Ireland"
Read line from content.rdf
is $topic in $line { if yes set trigger =1}
if trigger =1 start putting the lines in the output file
is $topicstub in $line
if ($topicstub =1 and $topic !=1)
end of segment detected so stop writing to output file.
---
Roughly, you read the content or structure rdf until you get to the start of the topic you require. If the start of topic is detected, you begin to write the lines to your output file. If the end of the topic (a topic other than the one you require) is detected stop writing to the output file.
Transferring the resulting RDF to SQL is not necessarily that difficult as it is essentially just parsing text. Some error conditions have to be detected, such as where sloppy editing puts a \ instead of a / or an @ in an URL.
Some examples of code have been given in this thread:
[webmasterworld.com...]
I think that there is a Python links directory program that uses the Dmoz RDFs as well.
[oedipus.sourceforge.net...]
However this program is no longer being maintained.
I needed to slice and dice to produce an Ireland directory from the Dmoz data so my solution was more page or Topic orientated and used a number of tables to regenerate each Topic as a set of extremely simple static webpages (I think that there around 1000 pages for Ireland with something in the region of 7785 links). There is a tendency for people to try and dump the RDFs into two SQL tables (content and structure). However I did not use this approach as I needed a finer granularity on the data as I had planned on incorporating a deadheader (404 checker) spider along with other content. I wrote the rippers,parsers and page generator in TCL (some very gruesome code ;) ) as this is the language I use for writing spiders.
As far as I can remember, there are some Perl versions of Dmoz type directory programs around. One of them had a downright intense table structure the last time I looked at it (Catalog - [freesoftware.fsf.org...] ). Dmoz itself has some good links on software that uses the RDFs with a MySQL backend:
[dmoz.org...]
While writing the rippers/parsers and working out a good structure are not impossible, it is important to have an objective in mind. If you only want to mess around with the data then it probably would be simpler to use some of the existing programs outlined above. Some of these have input modules as well and as such you could essentially set one of these programs up on your PC and write a client in PHP or something to access the data. Indeed it would probably be a lot easier than reinventing the wheel.
Regards...jmcc