Forum Moderators: open
RDF is bad xml, no parser will parse it, some attributes miss the closing ", probably because of strange encodings for some languages.
Some resources have duplicate entries in the dump.
Dumps are as old as hell. In tags.html is writes 'Rdfs not pushed'. Does that meen they are not dumping because of the problems.
Can anyone please direct me to more info on these topics, or comfort me with some gentle words of soon recovery.
Slovenac
As to the closing quotes, are you sure your parser is UTF-8 aware?
The problems with the RDF dump has been one of the hottest ODP related topics over the last weeks here in this forum. You'll find lots of info here, with this thread being one of the lot:
[webmasterworld.com...]
Do you refer to the newest RDF dump (without catid tags) at [rdf.dmoz.org...] or to the old files at [dmoz.org...]
As to the closing quotes, are you sure your parser is UTF-8 aware?
I tried Xerces-C++, version 2.2.0, and Expat, they should be ok with UTF-8, but they break. I am using the old files from dmoz/rdf.
This is what Xerces SAXcount utility writes out:
Fatal Error at file E:\DMOZ RDF\content.rdf.u8, line 4691225, char 31
Message: Invalid character (Unicode: 0x3)