Welcome to WebmasterWorld Guest from 35.173.234.237

Forum Moderators: phranque

Message Too Old, No Replies

DMOZ Replacement Suggestions?

     
5:33 am on May 7, 2017 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 22, 2004
posts:142
votes: 0


I'm looking for a replacement for the DMOZ directory, anyone have any suggestions?

But before you reply, understand how I was using it:
One of the tools on my site is a "Random Website" tool, that lets you essentially browse a 'random' website. It's a neat tool, useful for sparking brainstorming, and checking network connections where you want to be 100% sure you are not getting cached data.
Well, where does one come up with a list of, in this case, 3.5-4m website? Well, my site simply downloaded the whole DMOZ dump once a week, parsed out all the URLs and built them into an SQL database. It worked very well... until now. DMOZ is dead!

So now I need a new way to aggregate a database of at least a few million websites of no one particular type.
Building my own spider is an interesting idea... but for now, I'd rather pass on that avenue and see if theres anything else out there that is regularly updated/maintained, that I can simply download as needed?

Is it kosher to post relevant links to our own sites on this forum or no? If so I'll post a link so you can see how this thing works. If not, then just trust me it's neat enough to put some effort into saving it if I can.
6:07 am on May 7, 2017 (gmt 0)

Full Member

5+ Year Member

joined:Aug 16, 2010
posts:257
votes: 21


You can look at the common crawl foundation. They crawl and publish 3 billion+ web pages every month.

[commoncrawl.org...]

I think you can download the url index and use that?
6:09 am on May 7, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member keyplyr is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Sept 26, 2001
posts:12913
votes: 893


The dmoz archive is still available here: [archive.org...]

Is it kosher to post relevant links to our own sites on this forum or no?
Sorry, no personal links in this forum.
6:34 am on May 7, 2017 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 22, 2004
posts:142
votes: 0


That commoncrawl site looks very promising.
I know the dmoz archive is still available, but its not updated. I'm already running a url database based on the last posted dmoz archive. But it will eventually go stale and that's no good. Plus I'm liking the idea of going from a few million to a few billion sites in there :D
6:44 am on May 7, 2017 (gmt 0)

Moderator from US 

WebmasterWorld Administrator martinibuster is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Apr 13, 2002
posts:14912
votes: 491


No, no. Don't waste your time on Archive.org. There's a mirror here

http://dmoztools.net/ [dmoztools.net]

There is no replacement for DMOZ. There's nothing like it left. It was the last of its kind.
9:10 am on May 29, 2017 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 22, 2004
posts: 142
votes: 0


The commoncrawl index is very interesting. But after taking a close look at it, it's just too big. Downloading the smallest version of it, uses more than 4x my servers total monthly bandwidth allotment. It's crazy big. I could potentially download it all at home, process it into a nice simple list of urls, bzip it and upload it to my server for insertion into an sql database... but even doing that, i don't think I can download 9+ TB every month on my home internet connection without causing problems.

I suppose one solution would be to not download all the segments. I could download every 10th segment, or something like that. But if the database is in anything but totally random order, that's going to leave with with groups of some kinds of sites and holes of others. A better way to "thin out" the database, would be to download everything, and only add every 10th url to *my* database, but that doesn't help me with the "i can't download 9 TB every month" problem. Hrmm
8:12 am on May 30, 2017 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 22, 2004
posts: 142
votes: 0


Ok update to that last post, you CAN download JUST a url index. It's kind of hidden on their website, but its only 201 GB which is no problem to download. I'm testing out my script right now but this seems like it should work as a good replacement.

And instead of having 3.5 million URLs, I'll have at least 250ish million, possibly many more depending on how much disk space I have. Woohoo
8:46 am on May 30, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10122
votes: 1005


A solution! Hope it fills the need.

Just wondering why you don't spider yourself instead of relying on other's work? That way you are 100% independent and don't have to worry about another "data source" disappearing.
9:23 am on May 30, 2017 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 22, 2004
posts: 142
votes: 0


Because processing a URL dump is an order of magnitude easier than running your own web crawling spider that builds it's own database.

Or to put it another way, why reinvent the wheel when commoncrawl.org is giving away free bicycles :D
10:02 am on May 30, 2017 (gmt 0)

Senior Member from US 

WebmasterWorld Senior Member tangor is a WebmasterWorld Top Contributor of All Time 10+ Year Member Top Contributors Of The Month

joined:Nov 29, 2005
posts:10122
votes: 1005


But you do get what you pay for... and any expectation that will continue to be either free or even available is ... interesting?

DMOZ had a good long run. Given the web and the rates that things change for "free" things in particular it might be iffy to bank on that for the long run.

(spiders are easy, just let 'em loose and sit back. What to do with the data is a different thing.)
10:09 am on May 30, 2017 (gmt 0)

Junior Member

10+ Year Member

joined:Nov 22, 2004
posts: 142
votes: 0


What to do with the data is not a different thing though. It's all the same thing.
And when you parse someone else's data dump, you can just build up a new dv from scratch every time, then delete your old one, nice and easy. If you're maintaining a spider that not only has to read html, but interpret it, find other links, have a base set of websites to start crawling on, and always maintain a database of billions of URLs... I don't see the benefit. Especially when commoncrawl exists. A couple days working on a processing script and I'm golden. Sure, one day they might close down, and if I can't find another similar solution, then I just might have to crawl on my own. But until then, it just makes no sense to do that now when someone else already does it better than I ever could, and shares their results for free.