Microsoft's ZeRO-2 with DeepSpeed trains neural networks with up to 170 billion parameters

Microsoft today upgraded its DeepSpeed library for training large neural networks with ZeRO-2. Microsoft says the memory optimizing tech is capable of training machine learning models with 170 billion parameters. For context, Nvidia's massive Megatron language model is one of the biggest in the world today at 11 billion parameters.

Today's announcement follows the February open source release of the DeepSpeed library, which was used to create Turing-NLG. At 17 billion parameters, Turing-NLG is the largest known language model in the world today. Microsoft introduced the Zero Redundancy Optimizer (ZeRO) in February alongside DeepSpeed.

ZeRO achieves its results by reducing memory redundancy in data parallelism, another technique for fitting large models into memory. Whereas ZeRO-1 included some model state memory optimization, ZeRO-2 delivers optimization for activation memory and fragmented memory.

DeepSpeed is made for distributed model training across multiple servers, but ZeRO-2 also comes with improvements for training models on a single GPU, reportedly training models like Google's BERT 30% faster.

Additional details will be announced Wednesday in a keynote address by Microsoft CTO Kevin Scott.

The news comes at the start of Microsoft's all-digital Build developer conference, where a number of AI developments have been announced -- including the debut of the WhiteNoise toolkit for differential privacy in machine learning and Project Bonsai for industrial applications of AI.

Last week, Nvidia CEO Jensen Huang unveiled the Ampere GPU architecture and A100 GPU. The new GPU chip -- alongside trends like the creation of multimodal models and massive recommender systems -- will lead to larger machine learning models in the years ahead.

More