MLCommons releases new benchmarks to boost ML performance

Understanding the performance characteristics of different hardware and software for machine learning (ML) is critical for organizations that want to optimize their deployments.

One of the ways to understand the capabilities of hardware and software for ML is by using benchmarks from MLCommons — a multi-stakeholder organization that builds out different performance benchmarks to help advance the state of ML technology.

The MLCommons MLPerf testing regimen has a series of different areas where benchmarks are conducted throughout the year. In early July, MLCommons released benchmarks on ML training data and today is releasing its latest set of MLPerf benchmarks for ML inference. With training, a model learns from data, while inference is about how a model "infers" or gives a result from new data, such as a computer vision model that uses inference for image recognition.

The benchmarks come from the MLPerf Inference v2.1 update, which introduces new models, including SSD-ResNeXt50 for computer vision, and a new testing division for inference over the network to help expand the testing suite to better replicate real-world scenarios.

"MLCommons is a global community and our interest really is to enable ML for everyone," Vijay Janapa Reddi, vice president of MLCommons said during a press briefing. "What this means is actually bringing together all the hardware and software players in the ecosystem around machine learning so we can try and speak the same language."

He added that speaking the same language is all about having standardized ways of claiming and reporting ML performance metrics.

How MLPerf measures ML inference benchmarks

Reddi emphasized that benchmarking is a challenging activity in ML inference, as there are any number of different variables that are constantly changing. He noted that MLCommons' goal is to measure performance in a standardized way to help track progress.

Inference spans many areas that are considered in the MLPerf 2.1 suite, including recommendations, speech recognition, image classification and object detection capabilities. Reddi explained that MLCommons pulls in public data, then has a trained ML network model for which the code is available. The group then determined a certain target quality score that submitters of different hardware systems platforms need to meet.

"Ultimately, our goal here is to make sure that things get improved, so if we can measure them, we can improve them," he said.

Results? MLPerf Inference has thousands

The MLPerf Inference 2.1 suite benchmark is not a listing for the faint of heart, or those that are afraid of numbers — lots and lots of numbers.

In total the new benchmark generated over 5,300 results, provided by a laundry list of submitters including Alibaba, Asustek, Azure, Biren, Dell, Fujitsu, Gigabyte, H3C, HPE, Inspur, Intel, Krai, Lenovo, Moffett, Nettrix, NeuralMagic, Nvidia, OctoML, Qualcomm, Sapeon and Supermicro.

"It's very exciting to see that we've got over 5,300 performance results, in addition to over 2,400 power measurement results," Reddi said. "So there's a wealth of data to look at."

The volume of data is overwhelming and includes systems that are just coming to market. For example, among Nvidia's many submissions are several for the company's next generation H100 accelerator that was first announced back in March.

"The H100 is delivering phenomenal speedups versus previous generations and versus other competitors," Dave Salvator, director of product marketing at Nvidia, commented during a press briefing that Nvidia hosted.

While Salvator is confident in Nvidia's performance, he noted that from his perspective it's also good to see new competitors show up in the latest MLPerf Inference 2.1 benchmarks. Among those new competitors is Chinese artificial intelligence (AI) accelerator vendor Biren Technology. Salvator noted that Biren brought in a new accelerator that he said made a "decent" first showing in the MLPerf Inference benchmarks.

"With that said, you can see the H100 outperform them (Biren) handily and the H100 will be in market here very soon before the end of this year," Salvator said.

Forget about AI hype, enterprises should focus on what matters to them

The MLPerf Inference numbers, while verbose and potentially overwhelming, also have a real meaning that can help to cut through AI hype, according to Jordan Plawner, senior director of Intel AI products.

"I think we probably can all agree there's been a lot of hype in AI," Plawner commented during the MLCommons press briefing. "I think my experience is that customers are very wary of PowerPoint in claims or claims based on one model."

Plawner noted that some models are great for certain use cases, but not all use cases. He said that MLPerf helps him and Intel communicate to customers in a credible way with a common framework that looks at multiple models. While attempting to translate real-world problems into benchmarks is an imperfect exercise, MLPerf has a lot of value.

"This is the industry's best effort to say here [is] an objective set of measures to at least say — is company XYZ credible," Plawner said.

How MLPerf measures ML inference benchmarks

Results? MLPerf Inference has thousands

Forget about AI hype, enterprises should focus on what matters to them

More