Google launches TensorFlow 0.8 with support for distributed model training

Google today is announcing the release of version 0.8 of its TensorFlow open-source machine learning software. The release is significant because it supports the ability to train machine learning models on more than just a single machine.

TensorFlow can be used for a type of artificial intelligence called deep learning, which involves training artificial neural networks on lots of data and then getting them to make inferences about new data. Training is a crucial step in the process.

With more than 1 million servers in its possession, Googlers love to scale out their software across many servers at once and balance out the work in order to do it more quickly and efficiently. But when TensorFlow was released to the public in November, it didn't support distributed training. And within less than 24 hours, people pointed it out as a GitHub issue.

"Our current internal distributed extensions are somewhat entangled with Google internal infrastructure, which is why we released the single-machine version first," Google senior fellow Jeff Dean wrote in reply. "The code is not yet in GitHub, because it has dependencies on other parts of the Google code base at the moment, most of which have been trimmed, but there are some remaining ones.

"We realize that distributed support is really important, and it's one of the top features we're prioritizing at the moment."

Now, five months later, Google has solved that problem.

This is significant because there are other types of machine learning software that do work on multiple machines. For example, while the Caffe deep learning framework can't be used for training in a distributed fashion, Yahoo made it work on top of the Hadoop open-source file system for big data using the Spark data processing engine. Deeplearning4j can handle distributed training, as can Microsoft's CNTK. But Theano, another popular framework (among many others available), can't.

By responding to the community in a timely fashion, Google can get more people to improve its technology and build more software with it. Google -- Jeff Dean himself -- has been boasting about the open-source community's excitement about the project, and now Google can also boast about how people can train using many machines at once when they choose TensorFlow.

"Even small clusters benefit from distributed TensorFlow, since adding more GPUs (graphics processing units) improves the overall throughput, and produces accurate results sooner," Google Brain software engineer Derek Murray wrote in a blog post.

In addition to distributed support, the 0.8 release comes with a distributed trainer for Google's Inception neural network, along with code for defining how distributed models should work.

Google released the TensorFlow Serving software for scaling out inferences in February.

More