Nvidia benchmark tests show impressive gains in training AI models

Nvidia announced that systems based on its graphics processor units (GPUs) are delivering 3 to 5 times better performance when it comes to training AI models than they did a year ago, according to the latest MLPerf benchmarks published yesterday.

The MLPerf benchmark is maintained by the MLCommons Association, a consortium backed by Alibaba, Facebook AI, Google, Intel, Nvidia, and others that acts as an independent steward.

The latest set of benchmarks span eight different workloads covering a range of use cases for AI model training, including speech recognition, natural language processing, object detection, and reinforcement learning. Nvidia claims its OEM partners were the only systems vendors to run all the workloads defined by the MLPerf benchmark across a total of 4,096 GPUs. Dell, Fujitsu, Gigabyte Technology, Inspur, Lenovo, Nettrix, and Supermicro all provided on-premises systems certified by Nvidia that were used to run the benchmark.

Nvidia claims that overall it improved more than any of its rivals, delivering as much as 2.1 times more performance than the last time the MLPerf benchmarks were run. Those benchmarks provide a reliable point of comparison that data scientists and IT organizations can use to make an apples-to-apples comparison between systems, said Paresh Kharya, senior director for product management for Nvidia. "MLPerf is an industry-standard benchmark," he said.

Trying to quantify the unknown

It's not clear to what degree IT organizations are relying on consortiums' benchmarks to decide what class of system to acquire. Each workload deployed by an IT team is fairly unique, so benchmarks are no guarantee of actual performance. Arguably, the most compelling thing about the latest benchmark results is they show that systems acquired last year or even earlier continue to improve in overall performance as software updates are made. That increased level of performance could reduce the pace at which Nvidia-based systems may need to be replaced.

Of course, the number of organizations investing in on-premises IT platforms to run AI workloads is unknown. Some certainly prefer to train AI models in on-premises IT environments for a variety of security, compliance, and cloud networking reasons. However, the cost of acquiring a GPU-based server tends to make consuming GPUs on demand via a cloud service a more attractive alternative for training AI models until the organization hits a certain threshold in number of models being trained simultaneously.

Alternatively, providers of on-premises platforms are increasingly offering pricing plans that enable organizations to consume on-premises IT infrastructure using the same model as a cloud service provider.

Other classes of processors might end up being employed to train an AI model. Right now, however, GPUs -- thanks to their inherent parallelization capabilities -- have proven themselves to be the most efficient option.

Regardless of the platform employed, the number of AI models being trained continues to steadily increase. There is no shortage of use cases involving applications that could be augmented using AI. The challenge in many organizations now is prioritizing AI projects given the cost of GPU-based platforms. Of course, as consumption of GPUs increases, the cost of manufacturing them will eventually decline.

As organizations create their road maps for AI, they should be able to safely assume that both the amount of time required and the total cost of training an AI model will continue to decline in the years ahead -- even allowing for the occasional processor shortage brought on by unpredictable "black swan" events such as the COVID-19 pandemic.

Trying to quantify the unknown

More