Yahoo open-sources CaffeOnSpark deep learning framework for Hadoop

Yahoo today is releasing some key artificial intelligence software (AI) under an open-source license. The company last year built a library called CaffeOnSpark to perform a popular type of AI called "deep learning" on the vast swaths of data kept in its Hadoop open-source file system for storing big data. Now it's becoming available for anyone to use under an open-source Apache license on GitHub.

Written primarily in C++, CaffeOnSpark was developed in order to take advantage of the Spark data processing engine that can outperform Hadoop for certain computations. Spark has the MLlib machine learning library, but it can't do deep learning, which involves training artificial neural nets on large quantities of data and then getting them make inferences on new data. But like any other big technology company, Yahoo wants to be efficient with its time and its infrastructure.

"You don't have to set up separate deep learning clusters -- you can run deep learning where your data is today," Sumeet Singh, Yahoo's senior director of product management for cloud and big data platforms, told VentureBeat in an interview. "You don't have to copy data back and forth between these clusters for this specialized model training." As a result, engineers could combine deep learning with more traditional machine learning approaches.

Baidu, Facebook, Google, and Twitter have all open-sourced deep learning software in the past. This gets people outside these companies collaborating and making the tools better, and it can even lead to finding new talent to hire. Certain companies have rallied around certain stacks. For example, both Facebook and Twitter are using the Torch open-source deep learning framework. Google and Pinterest have used Caffe.

Yahoo has recently made many open-source contributions over the years, and Hadoop was actually born at Yahoo. More recently, Yahoo has open-sourced the Anthelion web crawler and the Data Sketches counting algorithms. And it recently released a 13TB data set for machine learning researchers in academia.

In addition to this tool, there is also startup Skymind's DL4J (it stands for deep learning for Java) open-source library for doing deep learning on Hadoop, but Andy Feng, vice president of architecture at Yahoo, told VentureBeat it didn't quite meet Yahoo's needs. Besides, the IQ Engines team that Yahoo acquired in 2013 was already familiar with Caffe, Feng said.

CaffeOnSpark supports deployments on generic x86 chips or graphics processing units (GPUs). It can be run on cloud infrastructure or companies' on-premises data centers. Servers running this distributed software can be wired up over Ethernet or faster InfiniBand. It's available as a third-party Spark package.

Internally at Yahoo, the software -- which Yahoo first talked about in a Tumblr post in September -- has been used for Flickr, spam detection, account security, and content recommendation.

See Feng's blog post for more detail. Documentation for CaffeOnSpark is here.