Yahoo Releases Machine Learning Dataset for Researchers

Yahoo has released the largest-ever machine learning dataset to the academic research community, aiming at advancing the field of large-scale machine learning and recommender systems. "Many academic researchers and data scientists don't have access to truly large-scale datasets because it is traditionally a privilege reserved for large companies," said Suju Rajan, director of research, Yahoo Labs. "We are releasing this dataset for independent researchers because we value open and collaborative relationships with our academic colleagues, and are always looking to advance the state-of-the-art in machine learning and recommender systems."

The Yahoo News Feed dataset is a collection based on a sample of anonymized user interactions on the news feeds of several Yahoo properties, including the Yahoo homepage, Yahoo News, Yahoo Sports, Yahoo Finance, Yahoo Movies, and Yahoo Real Estate. The dataset stands at a massive ~110B events (13.5TB uncompressed) of user-news item interaction data, collected by recording the user-item interactions of about 20M users from February 2015 to May 2015.

The dataset provides categorized demographic information (age range, gender, and generalized geographic data) for a subset of the anonymized users. On the item side, the title, summary and key-phrases of the news article in question are also included, and interaction data is timestamped with the user's local time and also contains partial information of the device used to access the news feeds.

The dataset is available as part of the Yahoo Labs Webscope data-sharing program, which is a reference library of scientifically-useful datasets comprised of anonymized user data for non-commercial use.