Forum Moderators: phranque

Message Too Old, No Replies

DMOZ Replacement Suggestions?

         

l008comm

5:33 am on May 7, 2017 (gmt 0)

10+ Year Member Top Contributors Of The Month



I'm looking for a replacement for the DMOZ directory, anyone have any suggestions?

But before you reply, understand how I was using it:
One of the tools on my site is a "Random Website" tool, that lets you essentially browse a 'random' website. It's a neat tool, useful for sparking brainstorming, and checking network connections where you want to be 100% sure you are not getting cached data.
Well, where does one come up with a list of, in this case, 3.5-4m website? Well, my site simply downloaded the whole DMOZ dump once a week, parsed out all the URLs and built them into an SQL database. It worked very well... until now. DMOZ is dead!

So now I need a new way to aggregate a database of at least a few million websites of no one particular type.
Building my own spider is an interesting idea... but for now, I'd rather pass on that avenue and see if theres anything else out there that is regularly updated/maintained, that I can simply download as needed?

Is it kosher to post relevant links to our own sites on this forum or no? If so I'll post a link so you can see how this thing works. If not, then just trust me it's neat enough to put some effort into saving it if I can.

bhukkel

6:07 am on May 7, 2017 (gmt 0)

10+ Year Member



You can look at the common crawl foundation. They crawl and publish 3 billion+ web pages every month.

[commoncrawl.org...]

I think you can download the url index and use that?

keyplyr

6:09 am on May 7, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



The dmoz archive is still available here: [archive.org...]

Is it kosher to post relevant links to our own sites on this forum or no?
Sorry, no personal links in this forum.

l008comm

6:34 am on May 7, 2017 (gmt 0)

10+ Year Member Top Contributors Of The Month



That commoncrawl site looks very promising.
I know the dmoz archive is still available, but its not updated. I'm already running a url database based on the last posted dmoz archive. But it will eventually go stale and that's no good. Plus I'm liking the idea of going from a few million to a few billion sites in there :D

martinibuster

6:44 am on May 7, 2017 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



No, no. Don't waste your time on Archive.org. There's a mirror here

http://dmoztools.net/ [dmoztools.net]

There is no replacement for DMOZ. There's nothing like it left. It was the last of its kind.

l008comm

9:10 am on May 29, 2017 (gmt 0)

10+ Year Member Top Contributors Of The Month



The commoncrawl index is very interesting. But after taking a close look at it, it's just too big. Downloading the smallest version of it, uses more than 4x my servers total monthly bandwidth allotment. It's crazy big. I could potentially download it all at home, process it into a nice simple list of urls, bzip it and upload it to my server for insertion into an sql database... but even doing that, i don't think I can download 9+ TB every month on my home internet connection without causing problems.

I suppose one solution would be to not download all the segments. I could download every 10th segment, or something like that. But if the database is in anything but totally random order, that's going to leave with with groups of some kinds of sites and holes of others. A better way to "thin out" the database, would be to download everything, and only add every 10th url to *my* database, but that doesn't help me with the "i can't download 9 TB every month" problem. Hrmm

l008comm

8:12 am on May 30, 2017 (gmt 0)

10+ Year Member Top Contributors Of The Month



Ok update to that last post, you CAN download JUST a url index. It's kind of hidden on their website, but its only 201 GB which is no problem to download. I'm testing out my script right now but this seems like it should work as a good replacement.

And instead of having 3.5 million URLs, I'll have at least 250ish million, possibly many more depending on how much disk space I have. Woohoo

tangor

8:46 am on May 30, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



A solution! Hope it fills the need.

Just wondering why you don't spider yourself instead of relying on other's work? That way you are 100% independent and don't have to worry about another "data source" disappearing.

l008comm

9:23 am on May 30, 2017 (gmt 0)

10+ Year Member Top Contributors Of The Month



Because processing a URL dump is an order of magnitude easier than running your own web crawling spider that builds it's own database.

Or to put it another way, why reinvent the wheel when commoncrawl.org is giving away free bicycles :D

tangor

10:02 am on May 30, 2017 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



But you do get what you pay for... and any expectation that will continue to be either free or even available is ... interesting?

DMOZ had a good long run. Given the web and the rates that things change for "free" things in particular it might be iffy to bank on that for the long run.

(spiders are easy, just let 'em loose and sit back. What to do with the data is a different thing.)

l008comm

10:09 am on May 30, 2017 (gmt 0)

10+ Year Member Top Contributors Of The Month



What to do with the data is not a different thing though. It's all the same thing.
And when you parse someone else's data dump, you can just build up a new dv from scratch every time, then delete your old one, nice and easy. If you're maintaining a spider that not only has to read html, but interpret it, find other links, have a base set of websites to start crawling on, and always maintain a database of billions of URLs... I don't see the benefit. Especially when commoncrawl exists. A couple days working on a processing script and I'm golden. Sure, one day they might close down, and if I can't find another similar solution, then I just might have to crawl on my own. But until then, it just makes no sense to do that now when someone else already does it better than I ever could, and shares their results for free.