Forum Moderators: open

Message Too Old, No Replies

Who needs the RDF?

         

heini

5:35 pm on Jan 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I'm seeing several hits from new dmoz listings coming from sites utilizing the ODP.
Seems like more and more of them use scripts to refresh their version of the odp.
Wouldn't that be an option for the big guys too?

g1smd

8:22 pm on Jan 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



Sites scraping the dmoz server are using up some of their resources. It is probably better for high volume external sites to have their own database, generated from the RDF, to serve to their users independantly.

choster

10:07 pm on Jan 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



There is an official ODP mirror now available, first reported here back in November at [webmasterworld.com...] (thread on PR effect of the mirror is [webmasterworld.com...] ).

As of yet, I do not know of any other mirrors, and for that matter this one was not widely publicized either to editors or to the public. However, in the future, staff has indicated that they would like to expand the number of such mirrors. And perhaps some of those dmoz scraper scripts can be pointed to one of the new mirrors.

The mirror hosts only copies of the public pages, not the database itself; all submission, editing, and communications are still on dmoz.org alone. Anyone who's interested in hosting an ODP mirror will need

- to contact staff@dmoz.org to make the appropriate arrangements.

Hardware:
- 18GB hard drive
- 128MB memory
- 950 MHz processor

Software:
- sshd (should support ssh version 2)
- rsync
- apache

Login configuration:
- A login for dmoz staff. The login must have permission to write to the Apache server's DocumentRoot, so files can be moved.

Apache configuration:
- A ServerAlias for xx.dmoz.org, where xx is the relevant country code.
- Redirect all /editors and /cgi-bin requests to the main dmoz server

No advertising is permitted on the mirrors; a small attribution (as at the bottom of ch.dmoz.org pages) is permitted.

NFFC

10:41 pm on Jan 10, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



>option for the big guys too?

I think an option, a wish of Tara of researchbuzz fame if I remember correctly, would be a DMOZ "What's New" section. Without the RDF the only real way to find those new listings is to spider the whole thing, I'm sure a What's New could help reduce that need and lessen the load.

Of course with the constant improvements of the directory maybe a "What's Nuked" section would be needed also;)

Dynamoo

12:06 pm on Jan 11, 2003 (gmt 0)

10+ Year Member



Just to quote from dmoz.org's robots.txt file:

# Please do not crawl us faster than 1 hit/second
#
User-agent: *
Disallow: /cgi-bin/
Disallow: /editors/

In other words, please don't hammer our 460,000 pages by repeatedly crawling. At that rate it takes a minimum of 5.3 days to crawl the directory.. so RDF dumps and mirrors are quite useful :)

daamsie

2:17 am on Jan 13, 2003 (gmt 0)



At that rate it takes a minimum of 5.3 days to crawl the directory.. so RDF dumps and mirrors are quite useful

hmm.. 5.3 days seems pretty good compared to 5.3 months ;)

victor

9:43 am on Jan 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



If you can write a better crawler, there's a niche market there for you.

After the 5.3 days data gathering don't forget to:

  • Dedupe (because some categories will have been renamed during that time, and you may have their contents twice)
  • Check for missing cats (ditto on renaming, so you may have missed them entirely)
  • Check each and every URL and flag as accessible, 404ing etc (This is an addd-value service that your users will love)
  • Check each and every URL against your last run and flag them as changed or not (ditto)
  • Offer a compact download so one of your subscribers receives only the changes from the last feed they took from you
  • Guarantee a monthly publication date with appropriate financial penalties if you miss.

As far as I know, this would not be against ODP's terms of service (subscribers would be paying for the added value and guarantees, not the raw data), and it may help allieviate the feeliing of powerlessness while we wait for the official dump.

creative craig

9:51 am on Jan 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I been hangin out at the resource-zone forum lately and it seems that not alot of the META editors know what is going on with the RDF dump either.

"What's Nuked section"

lol :)

Craig

lazerzubb

9:54 am on Jan 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



I would also like a Yahoo Stile, "What's New section" at dmoz.

Dynamoo

1:42 pm on Jan 13, 2003 (gmt 0)

10+ Year Member



Along with a "what's *old*" section for those few parts of the directory still living in 1999 ;)

rafalk

2:05 pm on Jan 13, 2003 (gmt 0)

10+ Year Member



I been hangin out at the resource-zone forum lately and it seems that not alot of the META editors know what is going on with the RDF dump either.

None would be a more accurate assessment. That having been said, it's been nearly 4 months now since the dump's been produced which even I think is pretty outrageous.

Dumpy

4:05 pm on Jan 13, 2003 (gmt 0)

10+ Year Member



I wish to commend you for your willingness to make a disparaging remark against Netscape and it's obvious willingness to NOT provide resources to DMOZ.

It is particularly brave of you, considering your knowledge that you can be barred from being a participant with a stroke of a key...and never know the reason or have a chance to be judged by your peers.

I AM IMPRESSED!

victor

4:21 pm on Jan 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



None would be a more accurate assessment.

The equivalent thread to this on DMOZ has 144 messages to date. None is not an accurate assessment.

mosley700

4:21 pm on Jan 13, 2003 (gmt 0)

10+ Year Member



>>I AM IMPRESSED!<<
I'm always impressed with this particular editor.

kctipton

4:52 pm on Jan 13, 2003 (gmt 0)

10+ Year Member



The meta-editors are about editing, not producing the RDF. As has been stated often, staff is responsible for RDF generation and fixing any problems which keep it from happening. No, no meta-editors can predict when a new RDF will be generated.

creative craig

5:38 pm on Jan 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



In the description of a META editor straight from the ODP:

The meta permission is highest level of permission granted. Meta editors are the leaders and community managers for the entire Directory. As a group, the Metas form the governing body of the ODP. They work with staff and the community to set directory guidelines and shape its growth.

The part about helping to shape the growth of the directy, being the community manger, the highest level in the ODP the governing body and they dont know nothing about the dump, they need to talk to the staff then.

Craig

g1smd

8:28 pm on Jan 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



All editors, from the very lowest level to the very highest level do not have any say in how the hardware and software that comprise the ODP is run. Period. How many times does this have to be re-stated before you will believe it? Yeah, they do need to talk to the staff, as it is the staff that run the hardware and the software. Staff is two people. The editor community (at all levels) is merely a user of that hardware and software, not the owners or controllers of it.

If you and a few friends started a community as a YahooGroup you would be able to decide who joined your group, delete people you didn't want in your group, delete postings that you didn't want in the archive and run that group how you wanted to run it (within Yahoo Terms and Conditions) but you would have no say in how Yahoo ran the hardware, whether your group messages were backed up or not, how often old messages were perged, how much filespace your group was allowed, and how much uptime the server would have. Although not the best analogy, it may serve as a pointer to how perhaps the ODP community uses the ODP facilities.

HuhuFruFru

9:21 pm on Jan 13, 2003 (gmt 0)

10+ Year Member



Hahaha, the biggest directory in the world, and "staff = two people"? THAT'S ridiculous.

victor

9:41 pm on Jan 13, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



It's also called productivity.

mosley700

9:46 pm on Jan 13, 2003 (gmt 0)

10+ Year Member



>>It's also called productivity. <<

Maybe you missed the contents of this thread. The RDF dump hasn't happened for four months.

rafalk

10:24 pm on Jan 13, 2003 (gmt 0)

10+ Year Member



At this point I think there's absolutely no use in rehashing the "Why doesn't Netscape support the ODP better" issue. It's been discussed to death, and frankly it's about as useful as discussing why the sky is blue.

A much more useful topic (and one more in line with this the initial purpose of this thread) would be: "Is it possible to create an RDF "dump" by crawling the ODP (or one of it's mirrors), and what would the pro's and con's of this be?"

Laisha

1:46 am on Jan 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



...and frankly it's about as useful as discussing why the sky is blue.

Well, at least there is a concrete answer to why the sky is blue. ;)

victor

9:58 pm on Jan 14, 2003 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member



Is it possible to create an RDF "dump" by crawling the ODP (or one of it's mirrors), and what would the pro's and con's of this be?

Anything is possible. The hard part is going to be spotting the cats that have been moved or renamed while you've been crawling the ODP. If you don't spot them you'll miss whole cats or take some twice.

Crawling a mirror might be safer -- you'll only have the moved-cat problme if they refresh the mirrow while you are crawling but the data would be older.

ODP don't have this problem as they disallow cat moves during RDF generation. That creates a separate problem as there are now loads of cat changes lined up waiting for a pause in the RDFing.

beebware

6:34 pm on Jan 15, 2003 (gmt 0)

10+ Year Member



>> Crawling a mirror might be safer -- you'll only have the moved-cat problme if they refresh the mirrow while you are crawling but the data would be older. <<

The "official mirror" site is, to my knowledge, updated and synced in real time with the main ODP site. I think at one point it was 3 days behind, but as soon as the problem was spotted it caught back up.