Forum Moderators: open
As of yet, I do not know of any other mirrors, and for that matter this one was not widely publicized either to editors or to the public. However, in the future, staff has indicated that they would like to expand the number of such mirrors. And perhaps some of those dmoz scraper scripts can be pointed to one of the new mirrors.
The mirror hosts only copies of the public pages, not the database itself; all submission, editing, and communications are still on dmoz.org alone. Anyone who's interested in hosting an ODP mirror will need
- to contact staff@dmoz.org to make the appropriate arrangements.
Hardware:
- 18GB hard drive
- 128MB memory
- 950 MHz processor
Software:
- sshd (should support ssh version 2)
- rsync
- apache
Login configuration:
- A login for dmoz staff. The login must have permission to write to the Apache server's DocumentRoot, so files can be moved.
Apache configuration:
- A ServerAlias for xx.dmoz.org, where xx is the relevant country code.
- Redirect all /editors and /cgi-bin requests to the main dmoz server
No advertising is permitted on the mirrors; a small attribution (as at the bottom of ch.dmoz.org pages) is permitted.
I think an option, a wish of Tara of researchbuzz fame if I remember correctly, would be a DMOZ "What's New" section. Without the RDF the only real way to find those new listings is to spider the whole thing, I'm sure a What's New could help reduce that need and lessen the load.
Of course with the constant improvements of the directory maybe a "What's Nuked" section would be needed also;)
# Please do not crawl us faster than 1 hit/second
#
User-agent: *
Disallow: /cgi-bin/
Disallow: /editors/
In other words, please don't hammer our 460,000 pages by repeatedly crawling. At that rate it takes a minimum of 5.3 days to crawl the directory.. so RDF dumps and mirrors are quite useful :)
At that rate it takes a minimum of 5.3 days to crawl the directory.. so RDF dumps and mirrors are quite useful
hmm.. 5.3 days seems pretty good compared to 5.3 months ;)
After the 5.3 days data gathering don't forget to:
As far as I know, this would not be against ODP's terms of service (subscribers would be paying for the added value and guarantees, not the raw data), and it may help allieviate the feeliing of powerlessness while we wait for the official dump.
I been hangin out at the resource-zone forum lately and it seems that not alot of the META editors know what is going on with the RDF dump either.
None would be a more accurate assessment. That having been said, it's been nearly 4 months now since the dump's been produced which even I think is pretty outrageous.
It is particularly brave of you, considering your knowledge that you can be barred from being a participant with a stroke of a key...and never know the reason or have a chance to be judged by your peers.
I AM IMPRESSED!
The meta permission is highest level of permission granted. Meta editors are the leaders and community managers for the entire Directory. As a group, the Metas form the governing body of the ODP. They work with staff and the community to set directory guidelines and shape its growth.
The part about helping to shape the growth of the directy, being the community manger, the highest level in the ODP the governing body and they dont know nothing about the dump, they need to talk to the staff then.
Craig
If you and a few friends started a community as a YahooGroup you would be able to decide who joined your group, delete people you didn't want in your group, delete postings that you didn't want in the archive and run that group how you wanted to run it (within Yahoo Terms and Conditions) but you would have no say in how Yahoo ran the hardware, whether your group messages were backed up or not, how often old messages were perged, how much filespace your group was allowed, and how much uptime the server would have. Although not the best analogy, it may serve as a pointer to how perhaps the ODP community uses the ODP facilities.
A much more useful topic (and one more in line with this the initial purpose of this thread) would be: "Is it possible to create an RDF "dump" by crawling the ODP (or one of it's mirrors), and what would the pro's and con's of this be?"
Is it possible to create an RDF "dump" by crawling the ODP (or one of it's mirrors), and what would the pro's and con's of this be?
Anything is possible. The hard part is going to be spotting the cats that have been moved or renamed while you've been crawling the ODP. If you don't spot them you'll miss whole cats or take some twice.
Crawling a mirror might be safer -- you'll only have the moved-cat problme if they refresh the mirrow while you are crawling but the data would be older.
ODP don't have this problem as they disallow cat moves during RDF generation. That creates a separate problem as there are now loads of cat changes lined up waiting for a pause in the RDFing.
The "official mirror" site is, to my knowledge, updated and synced in real time with the main ODP site. I think at one point it was 3 days behind, but as soon as the problem was spotted it caught back up.