Microsoft and Nvidia partner to build AI supercomputer in the cloud

A supercomputer, providing massive amounts of computing power to tackle complex challenges, is typically out of reach for the average enterprise data scientist. However, what if you could use cloud resources instead? That's the rationale that Microsoft Azure and Nvidia are taking with this week’s announcement designed to coincide with the SC22 supercomputing conference.

Nvidia and Microsoft announced that they are building a “massive cloud AI computer." The supercomputer in question, however, is not an individually-named system, like the Frontier system at the Oak Ridge National Laboratory or the Perlmutter system, which is the world's fastest Artificial Intelligence (AI) supercomputer. Rather, the new AI supercomputer is a set of capabilities and services within Azure, powered by Nvidia technologies, for high performance computing (HPC) uses.

"There's widespread adoption of AI in enterprises across a full range of use cases, and so addressing this demand requires really powerful cloud AI computing instances," Paresh Kharya, senior director for accelerated computing at Nvidia, told VentureBeat. "Our collaboration with Microsoft enables us to provide a very compelling solution for enterprises that are looking to create and deploy AI at scale to transform their businesses.”

The hardware going into the Microsoft Azure AI supercomputer

Microsoft is hardly a stranger to Nvidia's AI acceleration technology, which is already in use by large organizations.

In fact, Kharya noted that Microsoft's Bing makes use of Nvidia-powered instances to help accelerate search, while Microsoft Teams uses Nvidia GPUs to help convert speech-to-text.

Nidhi Chappell, partner/GM of specialized compute at Microsoft, explained to VentureBeat that Azure AI-optimized virtual machine (VM) offerings, like the current generation of the NDm A100 v4 VM series, start with a single virtual machine (VM) and eight Nvidia Ampere A100 Tensor Core GPUs.

"But just like the human brain is composed of interconnected neurons, our NDm A100 v4-based clusters can scale up to thousands of GPUs with an unprecedented 1.6 Tb/s of interconnect bandwidth per VM," Chappell said. "Tens, hundreds, or thousands of GPUs can then work together as part of an InfiniBand cluster to achieve any level of AI ambition."

What's new is that Nvidia and Microsoft are doubling down on their partnership, with even more powerful AI capabilities.

Kharya said that as part of the renewed collaboration, Microsoft will be adding the new Nvidia H100 GPUs to Azure. Additionally, Azure will be upgrading to Nvidia's next-generation Quantum 2 InfiniBand, which doubles the available bandwidth to 400 Gigabits per second (Gb/s). (The current generation of Azure instances rely on the 200 Gb/s Quantum InfiniBand technology.)

Microsoft DeepSpeed getting a hopper boost

The Microsoft-Nvidia partnership isn't just about hardware. It also has a very strong software component.

The two vendors have already worked together using Microsoft's DeepSpeed deep learning optimization software to help train the Nvidia Megatron-Turing Natural Language Generation (MT-NLG) Large Language Model.

Chappell said that as part of the renewed collaboration, the companies will optimize Microsoft’s DeepSpeed with the Nvidia H100 to accelerate transformer-based models that are used for large language models, generative AI and writing computer code, among other applications.

"This technology applies 8-bit floating point precision capabilities to DeepSpeed to dramatically accelerate AI calculations for transformers — at twice the throughput of 16-bit operations," Chappell said.

AI cloud supercomputer for generative AI research

Nvidia will now also be using Azure to help with its own research into generative AI capabilities.

Kharya noted that a number of generative AI models for creating interesting content, have recently emerged, such as Stable Diffusion. He said that Nvidia is working on its own approach, called eDiff-I, to generate images from text prompts.

"Researching AI requires large-scale computing — you need to be able to use thousands of GPUs that are connected by the highest bandwidth, low latency networking, and have a really high performance software stack that's making all of this infrastructure work," Kharya said. "So this partnership expands our ability to train and to provide computing resources to our research [and] software development teams to create generative AI models, as well as offer services to our customers."

The hardware going into the Microsoft Azure AI supercomputer

Microsoft DeepSpeed getting a hopper boost

AI cloud supercomputer for generative AI research

More