Yahoo releases 13.5TB Webscope data set for machine learning researchers

Yahoo is today announcing the release of a large-scale data set that describes people's usage of news feeds on several of the company's web services, including Yahoo News and Yahoo Finance. The idea is to empower machine learning researchers in academia with very rich data.

The release of data is not, in and of itself, new for Yahoo -- there have been 56 previous releases in the Yahoo Labs Webscope program, which encompasses advertising, image, social, and ratings data, among other categories. This data set in particular covers 20 million people over the course of four months in 2015, and shows the types of devices people used to visit pages, how far down they got in the articles, and the top subjects of articles. There is data on people's locations, their ages (in some cases), and their gender -- all in an anonymized way.

What's interesting about today's release is the size of the data set: 13.5TB. That's a whole lot bigger than the biggest available until this point, which came out to about 1TB.

"Why am I excited? It's because I think these collaborations between the academic community and industry are crucial for research, design, and development of state-of-the-art artificial intelligence and machine learning techniques, to really handle big data in the real world out there," said Gert Lanckriet, a professor of electrical and computer engineering at the University of California, San Diego, during a Yahoo press event in San Francisco earlier this week.

This move casts Yahoo in a positive light, and is happening at a time when Yahoo could use some positive press.

It's been more than three years since Marissa Mayer became the company's CEO, and lately, investors have been pushing for corporate change. Last month, the company was said to be considering a spinoff of its core Internet properties. Last week, Yahoo was reported to be planning layoffs. This week, the New York Times reported a "brain drain" at the company. Mayer, for her part, just had twins.

But the company has also taken steps to please developers in recent weeks. It has open-sourced algorithms for running computations on streaming data and a web crawler specially designed for dealing with structured data on websites. Now comes this data release, which should be welcome in academic circles, especially for people looking to understand how and what people read and working to improve their algorithms.

A 100-row sample of the data that Yahoo provided to VentureBeat gives a hint of the variety of the data set, with articles on such subjects as stocks, schools, politics, sports, celebrities, and, randomly, lucha libre.

To be sure, Yahoo could have released even more data than this. Suju Rajan, director of research for personalization science at Yahoo Labs, said that she regularly works with petabyte-scale data, which is not abnormal for a web company like Yahoo. But then again, a data set of that size might be difficult for a single researcher to work with interactively. Even a 13TB pile of data might be surprisingly large.

"Many people will not be able to use it, but we think we can advance the research in that way," said Ricardo Baeza-Yates, Yahoo Labs' vice president of research.

A blog post has more detail on the news.

More