Google open-sources GPipe, a library for efficiently training large deep neural networks

If you're in the business of training large-scale AI systems, good news: Google's got your back. Google's AI research division today open-sourced GPipe, a library for "efficiently" training deep neural networks (layered functions modeled after neurons) under Lingvo, a TensorFlow framework for sequence modeling. It's applicable to any network consisting of multiple sequential layers, Google AI software engineer Yanping Huang said in a blog post, and allows researchers to "easily" scale performance.

"Deep neural networks (DNNs) have advanced many machine learning tasks, including speech recognition, visual recognition, and language processing. [E]ver-larger DNN models lead to better task performance and past progress in visual recognition tasks has also shown a strong correlation between the model size and classification accuracy," he added. "[In] GPipe ... we demonstrate the use of pipeline parallelism to scale up DNN training to overcome this limitation."

As Huang and colleagues explain in an accompanying paper ("GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism"), GPipe implements two nifty AI training techniques. One is synchronous stochastic gradient descent, an optimization algorithm used to update a given AI model's parameters, and the other is pipeline parallelism, a task execution system in which one step's output is streamed as input to the next step.

Most of GPipe's performance gains come from better memory allocation for AI models. On second-generation Google Cloud tensor processing units (TPUs), each of which contains eight processor cores and 64 GB memory (8 GB per core), GPipe reduced intermediate memory usage from 6.26 GB to 3.46GB, enabling 318 million parameters on a single accelerator core. Without GPipe, Huang says, a single core can only train up to 82 million model parameters.

That's not GPipe's only advantage. It partitions models across different accelerators and automatically splits miniature batches (i.e., "mini-batches") of training examples into smaller "micro-batches," and it pipelines execution across the micro-batches. This enables cores to operate in parallel, and furthermore accumulate gradients across the micro-batches, thereby preventing the partitions from affecting model quality.

In one experiment, Google trained a deep learning algorithm -- AmoebaNet-B -- with 557 million model parameters and sample images on TPUs, incorporating 1.8 billion parameters on each TPU (25 times more than is possible without GPipe). It performed "well" on popular datasets, Huang says, pushing single-crop ImageNet accuracy to 84.3 percent, CIFAR-10 accuracy to 99 percent, and CIFAR-100 accuracy to 91.3 percent.

Training speed improved, too. In a separate test involving the AmoebaNet-D algorithm, distributing the model across four times the number of second-gen TPU cores achieved a speedup of 3.5 times. And when Google researchers tested Transformer language models with eight billion parameters on third-generation TPU cores (the newest available), each of which has 16 cores and 256GB of memory (16GB per core), they recorded a speedup of 11 times.

"The ongoing development and success of many practical machine learning applications, such as autonomous driving and medical imaging, depend on achieving the highest accuracy possible," Huang wrote. "As this often requires building larger and even more complex models, we are happy to provide GPipe to the broader research community, and hope it is a useful infrastructure for efficient training of large-scale DNNs."

More