Forum Moderators: open
structure.rdf.u8.gz -- 448 MB
content.rdf.u8.gz -- 1.22 GB
On a side note, the DMOZ server pushed the data at a speedy 10mb/sec. Nice!
[dmoz.org...]
# If you need to examine many dmoz pages, please download the rdf file from
# [rdf.dmoz.org...] instead of crawling us.
Each will be a web page. If you assume 25K per page, that's around 11gig.
DMOZ asks that you don't crawl at faster than 1 page per second (http://www.dmoz.org/robots.txt).
If you do, you'll probably get banned.
If you don't, it'll take a minimum of 127 hours to access 460000 pages, plus the time to actually read the pages.
There is a significant chance that categories will have been added/moved/renamed/deleted during those 127+ hours. So your code should check for apparent inconsistencies caused by that -- otherwise, you may end up in a loop.
All-in-all, you need a good case to do it this way rather than using the RDF.