Nvidia will disclose Grace Hopper architectural details at Hot Chips

Nvidia engineers are delivering four technical presentations at next week's virtual Hot Chips conference focused on the Grace central processing unit (CPU), Hopper graphics processing unit (GPU), Orin system-on-chip (SoC), and NVLink Network Switch.

They all represent the company's plans to create high-end data center infrastructure with a full stack of chips, hardware and software.

The presentations will share new details on Nvidia's platforms for artificial intelligence (AI), edge computing, and high-performance computing, said Dave Salvator, director of product marketing for AI inference, benchmarking and cloud at Nvidia, in an interview with VentureBeat.

If there's a trend visible across the talks, all of them represent how accelerated computing has been accepted in the past few years in the design of modern data centers and systems at the edge of the network, Salvator said. No longer are CPUs expected to do all of the heavy lifting themselves.

The Hot Chips event

Regarding Hot Chips, Salvator said, "Historically, it's been a show where architects come together with architects to have a collegial environment, even though they're competitors. In years past, the show has had a tendency towards being a little CPU-centric with an occasional accelerator. But I think the interesting trendline, particularly from looking at the advanced program that's already been published on the AI chips website, is you're seeing a lot more accelerators. It's certainly from us, but also from others. And I think it's just a recognition that you know, that these accelerators are absolute game changers for the data center. That's a macro trend that I think we've been seeing."

He added, "I would posit that I think we've made probably the most significant progress in that regard. It's a combination of things, right? It's not just the GPUs happen to be good at something. It's a huge amount of concerted work that we've been doing, really for over a decade, to get ourselves to where we are today."

Speaking at a virtual Hot Chips event (normally held at Silicon Valley college campuses), Nvidia will address the annual gathering of processor and system architects. They’ll disclose performance numbers and other technical details for Nvidia’s first server CPU, the Hopper GPU, the latest version of the NVSwitch interconnect chip and the Nvidia Jetson Orin system-on-module (SoM).

The presentations provide fresh insights on how the Nvidia platform will hit new levels of performance, efficiency, scale and security.

Specifically, the talks demonstrate a design philosophy of innovating across the full stack of chips, systems and software where GPUs, CPUs and DPUs act as peer processors, Salvator said. Together they create a platform that’s already running AI, data analytics and high-performance computing jobs at cloud service providers, supercomputing centers, corporate data centers and autonomous systems.

Inside the Nvidia server CPU

Data centers require flexible clusters of CPUs, GPUs and other accelerators sharing massive pools of memory to deliver the energy-efficient performance today’s workloads demand.

Nvidia Grace CPU is the first data center CPU developed by Nvidia, built from the ground up to create the world’s first superchips.

Jonathon Evans, a distinguished engineer and 15-year veteran at Nvidia, will describe the Nvidia NVLink-C2C. It connects CPUs and GPUs at 900 gigabytes per second with five times the energy efficiency of the existing PCIe Gen 5 standard, thanks to data transfers that consume just 1.3 picojoules per bit.

NVLink-C2C connects two CPU chips to create the Nvidia Grace CPU with 144 Arm Neoverse cores. It’s a processor built to solve the world’s largest computing problems. Nvidia is using standard Arm cores as it didn't want to create custom instructions that could make programming more complex.

For maximum efficiency, the Grace CPU uses LPDDR5X memory. It enables a terabyte per second of memory bandwidth while keeping power consumption for the entire complex to 500 watts.

Nvidia designed Grace to deliver performance and energy efficiency to meet the demands of modern data center workloads powering digital twins, cloud gaming and graphics, AI, and high-performance computing (HPC). The Grace CPU features 72 Arm v9.0 CPU cores that implement Arm Scalable Vector Extensions version two (SVE2) instruction set. The cores also incorporate virtualization extensions with nested virtualization capability and S-EL2 support.

Nvidia Grace CPU is also compliant with the following Arm specifications: RAS v1.1 Generic Interrupt Controller (GIC) v4.1; Memory Partitioning and Monitoring (MPAM); and System Memory Management Unit (SMMU) v3.1.

Grace CPU was built to pair with either the Nvidia Hopper GPU to create the Nvidia Grace CPU Superchip for large-scale AI training, inference, and HPC, or with another Grace CPU to build a high-performance CPU to meet the needs of HPC and cloud computing workloads.

One NVLink

NVLink-C2C also links Grace CPU and Hopper GPU chips as memory-sharing peers in the Nvidia Grace Hopper Superchip, combining two separate chips in one module. It enables maximum acceleration for performance-hungry jobs such as AI training.

Anyone can build custom chiplets (or chip subcomponents) using NVLink-C2C to coherently connect to Nvida GPUs, CPUs, DPUs (data processing units) and SoCs, expanding this new class of integrated products. The interconnect will support AMBA CHI and CXL protocols used by Arm and x86 processors, respectively.

To scale at the system level, the new Nvidia NVSwitch connects multiple servers into one AI supercomputer. It uses NVLink, interconnects running at 900 gigabytes per second, more than seven times the bandwidth of PCIe Gen 5.

NVSwitch lets users link 32 Nvidia DGX H100 systems (a supercomputer in a box) into an AI supercomputer that delivers an exaflop of peak AI performance.

"That's going to allow multiple server nodes to talk to each other over NVLink with up to 256 GPUs," Salvator said.

Alexander Ishii and Ryan Wells, both veteran Nvidia engineers, will describe how the switch lets users build systems with up to 256 GPUs to tackle demanding workloads like training AI models that have more than a trillion parameters. The switch includes engines that speed data transfers using the Nvidia Scalable Hierarchical Aggregation Reduction Protocol. SHARP is an in-network computing capability that debuted on Nvidia Quantum InfiniBand networks. It can double data throughput on communications-intensive AI applications.

"The goal here with that is to deliver, you know, great improvements in cross socket performance. In other words, get bottlenecks out of the way," Salvator said.

Jack Choquette, a senior distinguished engineer with 14 years at the company, will provide a detailed tour of the Nvidia H100 Tensor Core GPU, aka Hopper. In addition to using the new interconnects to scale to new heights, it packs features that boost the accelerator’s performance, efficiency and security.

Hopper’s new Transformer Engine and upgraded Tensor Cores deliver a 30-times speedup compared to the prior generation on AI inference with the world’s largest neural network models. And it employs the world’s first HBM3 memory system to deliver a whopping 3 terabytes of memory bandwidth, NVIDIA’s biggest generational increase ever.

Among other new features, Hopper adds virtualization support for multi-tenant, multi-user configurations. New DPX instructions speed recurring loops for select mapping, DNA and protein-analysis applications. And Hopper packs support for enhanced security with confidential computing.

Choquette, one of the lead chip designers on the Nintendo 64 console early in his career, will also describe parallel computing techniques underlying some of Hopper’s advances.

Michael Ditty, an architecture manager with a 17-year tenure at the company, will provide new performance specs for Nvidia Jetson AGX Orin, an engine for edge AI, robotics and advanced autonomous machines.

It integrates 12 Arm Cortex-A78 cores and an Nvidia Ampere architecture GPU to deliver up to 275 trillion operations per second on AI inference jobs. That’s up to eight times greater performance at 2.3 times higher energy efficiency than the prior generation.

The latest production module packs up to 32 gigabytes of memory and is part of a compatible family that scales down to pocket-sized 5W Jetson Nano developer kits.

Software stack

All the new chips support the Nvidia software stack that accelerates more than 700 applications and is used by 2.5 million developers. Based on the CUDA programming model, it includes dozens of Nvidia software development kits (SDKs) for vertical markets like automotive (Drive) and healthcare (Clara), as well as technologies such as recommendation systems (Merlin) and conversational AI (Riva).

NVIDIA Grace CPU Superchip is built to provide software developers with a standards-platform. Arm provides a set of specifications as part of its System Ready initiative, which aims to bring standardization to the Arm ecosystem.

Grace CPU targets the Arm system standards to offer compatibility with off-the-shelf operating systems and software applications, and Grace CPU will take advantage of the Nvidia Arm software stack from the start.

The Nvidia AI platform is available from every major cloud service and system maker. Nvidia is working with leading HPC, supercomputing, hyperscale, and cloud customers for the Grace CPU Superchip. Grace CPU Superchip and Grace Hopper Superchip are expected to be available in the first half of 2023.

"With the data center architecture, these fabrics are designed to alleviate bottlenecks to really make sure that GPUs and CPUs can function together as peer processors," Salvator said.

The Hot Chips event

Inside the Nvidia server CPU

One NVLink

Software stack

More