MLPerf 3.1 adds large language model benchmarks for inference

MLCommons is growing its suite of MLPerf AI benchmarks with the addition of testing for large language models (LLMs) for inference and a new benchmark that measures performance of storage systems for machine learning (ML) workloads.

MLCommons is a vendor neutral, multi-stakeholder organization that aims to provide a level playing field for vendors to report on different aspects of AI performance with the MLPerf set of benchmarks. The new MLPerf Inference 3.1 benchmarks released today are the second major update of the results this year, following the 3.0 results that came out in April. The MLPerf 3.1 benchmarks include a large set of data with more than 13,500 performance results.

Submitters include: ASUSTeK, Azure, cTuning, Connect Tech, Dell, Fujitsu, Giga Computing, Google, H3C, HPE, IEI, Intel, Intel-Habana-Labs, Krai, Lenovo, Moffett, Neural Magic, Nvidia, Nutanix, Oracle, Qualcomm, Quanta Cloud Technology, SiMA, Supermicro, TTA and xFusion.

Continued performance improvement

A common theme across MLPerf benchmarks with each update is the continued improvement in performance for vendors — and the MLPerf 3.1 Inference results follow that pattern. While there are multiple types of testing and configurations for the inference benchmarks, MLCommons founder and executive director David Kanter said in a press briefing that many submitters improved their performance by 20% or more over the 3.0 benchmark.

Beyond continued performance gains, MLPerf is continuing to expand with the 3.1 inference benchmarks.

"We're evolving the benchmark suite to reflect what's going on," he said. "Our LLM benchmark is brand new this quarter and really reflects the explosion of generative AI large language models."

What the new MLPerf Inference 3.1 LLM benchmarks are all about

This isn't the first time MLCommons has attempted to benchmark LLM performance.

Back in June, the MLPerf 3.0 Training benchmarks added LLMs for the first time. Training LLMs, however, is a very different task than running inference operations.

"One of the critical differences is that for inference, the LLM is fundamentally performing a generative task as it's writing multiple sentences," Kanter said.

The MLPerf Training benchmark for LLM makes use of the GPT-J 6B (billion) parameter model to perform text summarization on the CNN/Daily Mail dataset. Kanter emphasized that while the MLPerf training benchmark focuses on very large foundation models, the actual task MLPerf is performing with the inference benchmark is representative of a wider set of use cases that more organizations can deploy.

"Many folks simply don't have the compute or the data to support a really large model," said Kanter. "The actual task we're performing with our inference benchmark is text summarization."

Inference isn't just about GPUs — at least according to Intel

While high-end GPU accelerators are often at the top of the MLPerf listing for training and inference, the big numbers are not what all organizations are looking for — at least according to Intel.

Intel silicon is well represented on the MLPerf Inference 3.1 with results submitted for Habana Gaudi accelerators, 4th Gen Intel Xeon Scalable processors and Intel Xeon CPU Max Series processors. According to Intel, the 4th Gen Intel Xeon Scalable performed well on the GPT-J news summarization task, summarizing one paragraph per second in real-time server mode.

In response to a question from VentureBeat during the Q&A portion of the MLCommons press briefing, Intel's senior director of AI products Jordan Plawner commented that there is diversity in what organizations need for inference.

"At the end of the day, enterprises, businesses and organizations need to deploy AI in production and that clearly needs to be done in all kinds of compute," said Plawner. "To have so many representatives of both software and hardware showing that it [inference] can be run in all kinds of compute is really a leading indicator of where the market goes next, which is now scaling out AI models, not just building them."

Nvidia claims Grace Hopper MLPef Inference gains, with more to come

_{Courtesy Nvidia}

While Intel is keen to show how CPUs are valuable for inference, GPUs from Nvidia are well represented in the MLPerf Inference 3.1 benchmarks.

The MLPerf Inference 3.1 benchmarks are the first time Nvidia's GH200 Grace Hopper Superchip was included. The Grace Hopper superchip pairs an Nvidia CPU, along with a GPU to optimize AI workloads.

"Grace Hopper made a very strong first showing delivering up to 17% more performance versus our H100 GPU submissions, which we're already delivering across the board leadership," Dave Salvator, director of AI at Nvidia, said during a press briefing.

The Grace Hopper is intended for the largest and most demanding workloads, but that's not all that Nvidia is going after. The Nvidia L4 GPUs were also highlighted by Salvator for their MLPerf Inference 3.1 results.

"L4 also had a very strong showing up to 6x more performance versus the best x86 CPUs submitted this round," he said.

Continued performance improvement

What the new MLPerf Inference 3.1 LLM benchmarks are all about

Inference isn't just about GPUs — at least according to Intel

Nvidia claims Grace Hopper MLPef Inference gains, with more to come

More