Forum Moderators: open

Message Too Old, No Replies

DMOZ size?

         

zoltan

7:53 am on Oct 19, 2003 (gmt 0)

10+ Year Member



One of my friends suggested to use our dedicated server to download the DMOZ directory using wget. Has anyone tried this method? If yes, what could be the size of the entire directory (all the HTML pages, I'm not refering to the rdf)?

Jeff_H

9:29 am on Oct 19, 2003 (gmt 0)

10+ Year Member



Oh, sorry, I gave you the wrong info. I'll leave it here, since it might be useful to others.
---------------------------------------------------
You can find the information site here:
[rdf.dmoz.org...]

structure.rdf.u8.gz -- 448 MB
content.rdf.u8.gz -- 1.22 GB

On a side note, the DMOZ server pushed the data at a speedy 10mb/sec. Nice!

zoltan

9:43 am on Oct 19, 2003 (gmt 0)

10+ Year Member



Thanks for the info... however I need something else.
I do not need the compressed size, I need to know the size of the entire directory if I download it file by file (well.. not file by file, using wget) all the HTMLs.

Yidaki

9:55 am on Oct 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>all the HTML pages, I'm not refering to the rdf

[dmoz.org...]

# If you need to examine many dmoz pages, please download the rdf file from
# [rdf.dmoz.org...] instead of crawling us.

victor

9:57 am on Oct 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



DMOZ say the have 460,000 categories.

Each will be a web page. If you assume 25K per page, that's around 11gig.

DMOZ asks that you don't crawl at faster than 1 page per second (http://www.dmoz.org/robots.txt).

If you do, you'll probably get banned.

If you don't, it'll take a minimum of 127 hours to access 460000 pages, plus the time to actually read the pages.

There is a significant chance that categories will have been added/moved/renamed/deleted during those 127+ hours. So your code should check for apparent inconsistencies caused by that -- otherwise, you may end up in a loop.

All-in-all, you need a good case to do it this way rather than using the RDF.

zoltan

10:55 am on Oct 19, 2003 (gmt 0)

10+ Year Member



Thanks for the advices.
Anyone know of a free script that converts rdf data to static HTML pages?

heini

11:05 am on Oct 19, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Did you check this cat [directory.google.com] out?

zoltan

11:23 am on Oct 19, 2003 (gmt 0)

10+ Year Member



Yes, I did...
However I would better try something that was used by one of the members of this board.
Any specific feedback would be appreciated.