Forum Moderators: open

Message Too Old, No Replies

Yahoo Labs Releases 13.5TB of Machine Learning Dataset For Researchers

         

engine

2:28 pm on Jan 15, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



If you're in academia this dataset release will help researchers better evaluate their models with real world data.

Today, we are proud to announce the public release of the largest-ever machine learning dataset to the research community. The dataset stands at a massive ~110B events (13.5TB uncompressed) of anonymized user-news item interaction data, collected by recording the user-news item interactions of about 20M users from February 2015 to May 2015.

The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate.
image. Yahoo Labs Releases 13.5TB of Machine Learning Dataset For Researchers [yahoolabs.tumblr.com]

tangor

3:58 pm on Jan 15, 2016 (gmt 0)

WebmasterWorld Senior Member 10+ Year Member Top Contributors Of The Month



"Hey, Mort! We need to order some more floppies!"

That's a chunk of data, whew! Four months data... what would the full year look like? (And what kind of machines are best suited?)

engine

4:31 pm on Jan 15, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



hehehe, I think we can guess some of what it says.

Academia cannot usually get such huge data dumps, so whatever it says is unimportant, imho.

creeking

8:25 pm on Jan 15, 2016 (gmt 0)

10+ Year Member



remember the release of the AOL search results?

anyone think someone could be identified by their "user-news" activity?

Robert Charlton

10:41 am on Jan 17, 2016 (gmt 0)

WebmasterWorld Administrator 10+ Year Member Top Contributors Of The Month



Yes, creeking, I do remember that vividly.

That said, I'm not sure whether the situations (and the data) are in any way analogous, but I truly don't know.