Nvidia and Intel show machine learning performance gains on latest MLPerf Training 2.1 results

MLCommons is out today with its latest set of machine learning (ML) MLPerf benchmarks, once again showing how hardware and software for artificial intelligence (AI) are getting faster.

MLCommons is a vendor-neutral organization that aims to provide standardized testing and benchmarks to help evaluate the state of ML software and hardware. Under the MLPerf testing name, MLCommons collects different ML benchmarks multiple times throughout the year. In September, the MLPerf Inference results were released, showing gains in how different technologies have improved inference performance.

Today, the new MLPerf benchmarks being reported include the Training 2.1 benchmark, which is for ML training; HPC 2.0 for large systems including supercomputers; and Tiny 1.0 for small and embedded deployments.

"The key reason why we're doing benchmarking is to drive transparency and measure performance," David Kanter, executive director of MLCommons, said during a press briefing. "This is all predicated on the key notion that once you can actually measure something, you can start thinking about how you would improve it."

How the MLPerf training benchmark works

Looking at the training benchmark in particular, Kanter said that MLPerf isn't just about hardware, it's about software too.

In ML systems, models need to first be trained on data in order to operate. The training process benefits from accelerator hardware, as well as optimized software.

Kanter explained that the MLPerf Training benchmark starts with a predetermined dataset and a model. Organizations then train the model to hit a target quality threshold. Among the primary metrics that MLPerf Training benchmark captures is time to train.

"When you look at the results, and this goes for any submission — whether it's training, tiny, HPC or inference — all of the results are submitted to say something," Kanter said. "Part of this exercise is figuring out what that something they say is."

The metrics can identify relative levels of performance and also serve to highlight improvement over time for both hardware and software.

John Tran, senior director of deep learning libraries and hardware architecture at Nvidia and chair of MLPerf Training at MLCommons, highlighted the fact that there were a number of software-only submissions for the latest benchmark.

"I find it continually interesting how we have so many software-only submissions and they don't necessarily need help from the hardware vendors," Tran said. "I think that's great and is showing the maturity of the benchmark and usefulness to people."

Intel and Habana Labs push training forward with Gaudi2

The importance of software was also highlighted by Jordan Plawner, sr. director of AI products at Intel. During the MLCommons press call, Plawner explained what he sees as the difference between ML inference and training workloads in terms of hardware and software.

"Training is a distributed-workload problem," Plawner said. "Training is more than just hardware, more than just the silicon; it's the software, it's also the network and running distributed-class workloads."

In contrast, Plawner said that ML inference can be a single-node issue that doesn't have the same distributed aspects, which provides a lower barrier to entry for vendor technologies than ML training.

In terms of results, Intel is well represented on the latest MLPerf Training benchmarks with its Gaudi2 technology. Intel acquired Habana Labs and its Gaudi technology for $2 billion in 2019 and have helped to advance the company's capabilities in recent years.

The most advanced silicon from Habana Labs is now the Gaudi2 system, which was announced in May. The latest Gaudi2 results show gains over the first set of benchmarks that Habana Labs reported with the MLPerf Training update in June. According to Intel, Gaudi2 improved by 10% for time-to-train in TensorFlow for both BERT and ResNet-50 models.

Nvidia H100 hops past predecessor

Nvidia is also reporting strong gains for its technologies in the latest MLPerf Training benchmarks.

Testing results for Nvidia's Hopper-based H100 with MLPerf Training show significant gains over the prior generation A100-based hardware. In an Nvidia briefing call discussing the MLCommons results, Dave Salvator, director of AI, benchmarking and cloud at Nvidia, said that the H100 provides 6.7 times more performance than the first A100 submission had for the same benchmarks several years ago. Salvator said that a key part of what makes the H100 perform so well is the integrated transformer engine that is part of the Nvidia Hopper chip architecture.

While H100 is now Nvidia's leading hardware for ML training, that's not to say the A100 hasn't improved its MLPerf Training results as well.

"The A100 continues to be a really compelling product for training, and over the last couple of years we've been able to scale its performance by more than two times from software optimizations alone," Salvator said.

Overall, whether it's with new hardware or continued software optimizations, Salvator expects there will be a steady stream of performance improvements for ML training in the months and years to come.

"AI's appetite for performance is unbounded, and we continue to need more and more performance to be able to work with growing datasets in a reasonable amount of time," Salvator said.

The need to be able to train a model faster is critical for a number of reasons, including the fact that training is an iterative process. Data scientists often need to train and then retrain models in order to get the desired results.

"That ability to train faster makes all the difference in not only being able to work with larger networks, but being able to employ them faster and get them doing work for you in generating value," Salvator said.

How the MLPerf training benchmark works

Intel and Habana Labs push training forward with Gaudi2

Nvidia H100 hops past predecessor

More