Forum Moderators: open

Message Too Old, No Replies

DMOZ - running your own mirror

rdf's, site breakdowns, is it safe?

         

claus

3:16 pm on Sep 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm thinking about running yet another ODP clone, albeit only a partial one (selected cats). As the dmoz.org site is currently less functional, what about those RDF feeds?

- are they getting updated?
- can they get accessed? (they're rather big, and if the main site does not work well, then how about these feeds)
- is it stable?
- other preferred methods (eg. "scraping" a well-working other mirror - an official one of course)?

Does anyone have experience, advice, or comments?

/claus

flicker

3:30 pm on Sep 28, 2003 (gmt 0)

10+ Year Member



The RDF feeds are on a different server (rdf.dmoz.org) than the public one that's been having so much trouble lately (dmoz.org). Like the editing server, the RDF server has been working great; it's only the one public server that's been suffering so much. That's my understanding anyway.

orlady

3:52 pm on Sep 28, 2003 (gmt 0)

10+ Year Member



Adding to what flicker said:
As far as I know, the RDF is being updated on a weekly basis right now.

windharp

4:08 pm on Sep 28, 2003 (gmt 0)

10+ Year Member



[Edit: When I wrote this, there was a posting complaining that the RDF server is not accessible via ping. Without that posting, mine looked a bit ridiculous, so I add this comment. Looks lik the original was removed or was lost when I posted]

Thats an old one: For security reasons, Netscape had their servers non-pingable before the server upgrade. I didn't check them all but assume they are still behind the firewall and for that reason are still not accessible with a ping.

claus

4:35 pm on Sep 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Thanks for that info windharp. It's good to know that you cannot judge the response time or uptime using ping, this info is not found on the dmoz site (at least i have not found it).

However, i could still use a little more feedback - any problems or issues to be aware of?

Any SE issues? Duplicate penalties, or whatever?

Is everything just wonderful, running such a mirror?

/claus
I guess my original post was too specific so it has been removed, sorry mods.

kwasher

4:36 pm on Sep 28, 2003 (gmt 0)

10+ Year Member



the most important bits of the directory are working perfectly - that's the editor side where thousands of sites are being listed every day

I was wondering... could they move to a 'distributed' kind of network. You know, I've seen this somewhere else, like NASA or something. Where you agree to donate some of your computers unused processor time to help them with the processing. Anyone know what I mean? (this is just a thought, I have NO idea if its possible)

--Kenn

NOTE: Sorry if that was off topic. As mentioned I have no idea if its possible... and with that in mind, I thought it was similar to mirroring but in a different way.

[edited by: kwasher at 4:57 pm (utc) on Sep. 28, 2003]

claus

4:45 pm on Sep 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> 'distributed' kind of network

As they encourage mirroring (and it is a good idea) they reduce some of the load on their own frontend servers and make the directory available to others at the same time. It's not the same as you suggest, but please, please, keep this thread on the topic of mirroring the ODP.

/claus

windharp

4:46 pm on Sep 28, 2003 (gmt 0)

10+ Year Member



[This really becomes a bit offtopic, but since it is at least about mirrors, I think this will be okay ;) ]

1) The central server farm was just made a bit more distributed - only that servers are placed in the same building :-) But there are several public servers right now (as was posted in lots of fora). As you surely can imagine (okay, at least I can do that for some reasons :) ) changing from a single to several servers is quite difficult. Lack of internal bandwidth, thinking of the best way to upgrade your mirrors, ...

2) Apart form some datausers like Google, there are some official mirrors in place: ch.dmoz.org and de.dmoz.org (a thirds one was in preparation, I think it was in india but I don't remember exactly. Server upgrade stopped this I suppose). Only disadvantage: While the servers were upgraded (see chapter 1) the synchronisation process was stopüped for some reason. I was away for the weekend, but until friday it has not been restarted. In my eyes it is quite reasonable to say that the internal problems should be dealt with first, while the external servers can wait. They are still working, only are "a bit" out of date.

Yidaki

5:17 pm on Sep 28, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>Any SE issues? Duplicate penalties, or whatever?
>Is everything just wonderful, running such a mirror?

curious too ...

(PARTIAL, customized clones)

claus

3:26 pm on Sep 29, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



One day later and these are my observations. As i wrote, i want only a partial mirror (selected cats, not all). To do this, these are the steps i now reckon must be taken (not necessarily in this order):

  • Download/GET at least a gzipped 242Mb RDF-file from dmoz (and perhaps a few more) weekly or less frequently
  • Unzip - it becomes quite large (they all do, as gz will compress these file-types to approximately 10% or so according to testing with one at 47Mb) - this one will be around 3Gb i guess.
  • create or install a program/script that will parse such a XML-formatted file
  • take precautions, as these files are not only large, they also have errors
  • make a "selector" to that program that identifies and extracts the relevant subset(s) of those files
  • make static pages out of them, including paths/URLs or put them in a database and make dynamic pages out of them
  • include appropriate attributions on these pages as well as statement of modifications - all according to the TOS
  • discard the un-needed parts of the huge files to free up space.

As i'm interested in mirroring no more than a small number of cats (between 10-50 out of 460,000), this process has so much overhead that it will be very much easier to just mirror the whole dmoz in stead. Then again, this will take an awful lot of server space (i have not yet been able to estimate exactly how much, as those files are huge. At least a few gigabytes i reckon)

A full dmoz mirror is not an option at present. Partly because of lack of sufficient disk space, and partly (more important) because it would not be relevant for me to mirror all cats. It would be quite easy to just proxy the whole dmoz.org site, but that is not the same as mirroring, it is not what i want to do, and there's no real user benefit in this, neither for me nor for dmoz.

So, at present i'm stuck. There's some links in the subcats of "Directories/Open_Directory_Project/" that i will investigate and then i'll either forget the idea completely or move on with it - others must have been there before i think.

Still, nobody want to (or can) share experiences?

That's a bit worrying. Surely, i can't be the only one in here having thought about this?

/claus

yapuka

4:44 pm on Sep 29, 2003 (gmt 0)

10+ Year Member



If you're interested in such a small number of cats, why don't you simply get the pages themselves, and then parse them into a DB or something else?

That should be much easier than dealing with the entire dump.

theseeker

5:05 pm on Sep 29, 2003 (gmt 0)

10+ Year Member



Unzip ...snip... this one will be around 3Gb i guess.

create or install a program/script that will parse such a XML-formatted file

I use a perl program to parse the files. Using the Compress::Zlib module, I don't have to unzip the file before processing it.

claus

6:40 pm on Sep 29, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>> why don't you simply get the pages themselves

Good question. First, because the dmoz TOS does not mention this option of course. Second because the RDF files apparently are being updated weekly, and are on another server than the dmoz frontend (at the moment more reliable it seems).

/claus

yapuka

8:12 pm on Sep 29, 2003 (gmt 0)

10+ Year Member



The license says:
Basic License. Netscape grants you a non-exclusive, royalty-free license to use, reproduce, modify and create derivative works from, and distribute and publish the Open Directory (...)

It doesn't say you have to use the RDF dump.

And yes the public servers are erratic, to say the least, at this time. But since you'd only want to re-fetch the pages every so often, you could use the mirrors, which are just about as fresh as the public pages and a little bit fresher than the RDF (or at least should be).

Glacai

9:35 pm on Sep 29, 2003 (gmt 0)

10+ Year Member



The file is 1.2gb uncompressed and if you're only extracting the urls and topic from a particluar cat I don't think you'll come across that many errors, I haven't yet. It does also seem to be very up to date, sites I got listed just over a week ago are in there.

claus

3:15 am on Sep 30, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Great posts, thanks everyone :)

As to just grabbing the pages, i mentioned the possibility in my first post, and it would be the easiest by far. Normally this is not a thing that webmasters welcome, so i thought i would look carefully for explicit mention of this in the TOS. I agree that it seems to be covered under the general terms "use, reproduce, modify and create derivative works from", so it seems as if you just make sure you have those attributions, there will be no problems.

I think i have to modify them as well, though. The suggested (required verbatim?) HTML markup is not very nice as i prefer CSS (does it even validate?) and that green color suits only one site and that's the one it's on, but that's details. At least i hope so.

Now, SE effects:

Has anyone got experience with this (kind of) mirroring being considered duplicate content? Penalties? After all, it is a (partial) copy of another site, even if allowed and encouraged. Opinions?

/claus

zoltan

8:09 am on Oct 19, 2003 (gmt 0)

10+ Year Member



"The file is 1.2gb uncompressed"

1.2gb of data or does this also contain the generated HTML pages?

cyberprosper

1:15 pm on Oct 19, 2003 (gmt 0)

10+ Year Member



SEO ---> It is definitely a copy in the eyes of the search engines. Your pages will not have a chance of ranking high, unless you get higher PR than other, similar pages.