Microsoft's updated DeepSpeed can train trillion-parameter AI models with fewer GPUs

Microsoft today released an updated version of its DeepSpeed library that introduces a new approach to training AI models containing trillions of parameters, the variables internal to the model that inform its predictions. The company claims the technique, dubbed 3D parallelism, adapts to the varying needs of workload requirements to power extremely large models while balancing scaling efficiency.

Single massive AI models with billions of parameters have made great strides in a range of challenging domains. Studies show they perform well because they can absorb the nuances of language, grammar, knowledge, concepts, and context, enabling them to summarize speeches, moderate content in live gaming chats, parse complex legal documents, and even generate code by scouring GitHub. But training the models requires enormous computational resources. According to a 2018 OpenAI analysis, from 2012 to 2018 the amount of compute used in the largest AI training runs grew more than 300,000 times with a 3.5-month doubling time, far exceeding the pace of Moore's law.

The enhanced DeepSpeed leverages three techniques to enable "trillion-scale" model training: data parallel training, model parallel training, and pipeline parallel training. Training a trillion-parameter model would require the combined memory of at least 400 Nvidia A100 GPUs (which have 40GB of memory each), and Microsoft estimates it would take 4,000 A100s running at 50% efficiency for about 100 days to complete the training. That is no match for the AI supercomputer Microsoft co-designed with OpenAI, which contains over 10,000 graphics cards, but attaining high computing efficiency tends to be difficult at that scale.

DeepSpeed divides large models into smaller components (layers) between four pipeline stages. Layers within each pipeline stage are further partitioned between four "workers," which perform the actual training. Each pipeline is replicated across two data-parallel instances, and the workers are mapped to multi-GPU systems. Thanks to these and other performance improvements, Microsoft says a trillion-parameter model could be scaled across as few as 800 Nvidia V100 GPUs.

The latest release of DeepSpeed also ships with ZeRO-Offload, a technology that exploits computational and memory resources on both GPUs and their host CPUs to allow training up to 13-billion-parameter models on a single V100. Microsoft claims that is 10 times larger than the state-of-the-art, making training accessible to data scientists with fewer computing resources.

"These [new techniques in DeepSpeed] offer extreme compute, memory, and communication efficiency, and they power model training with billions to trillions of parameters," Microsoft wrote in a blog post. "The technologies also allow for extremely long input sequences and power on hardware systems with a single GPU, high-end clusters with thousands of GPUs, or low-end clusters with very slow ethernet networks ... We [continue] to innovate at a fast rate, pushing the boundaries of speed and scale for deep learning training."

More