Nvidia releases TensorRT 8 for faster AI inference

Nvidia today announced the release of TensorRT 8, the latest version of its software development kit (SDK) designed for AI and machine learning inference. Built for deploying AI models that can power search engines, ad recommendations, chatbots, and more, Nvidia claims that TensorRT 8 cuts inference time in half for language queries compared with the previous release of TensorRT.

Models are growing increasingly complex, and demand is on the rise for real-time deep learning applications. According to a recent O'Reilly survey, 86.7% of organizations are now considering, evaluating, or putting into production AI products. And Deloitte reports that 53% of enterprises adopting AI spent more than $20 million in 2019 and 2020 on technology and talent.

TensorRT essentially dials a model's mathematical coordinates to a balance of the smallest model size with the highest accuracy for the system it'll run on. Nvidia claims that TensorRT-based apps perform up to 40 times faster than CPU-only platforms during inference, and that TensorRT 8-specific optimizations allow BERT-Large -- one of the most popular Transformer-based models -- to run in 1.2 milliseconds.

Sparsity, a performance technique leveraged by Nvidia's Ampere architecture GPUs, among others, increases efficiency in TensorRT 8 by reducing computational operations. Meanwhile, quantization-aware training enables developers to use trained models to run inference without sacrificing much accuracy.

"[It's] imperative for enterprises to deploy state-of-the-art inferencing solutions," Nvidia VP of developer programs Greg Estes said in a press release. "The latest version of TensorRT introduces new capabilities that enable companies to deliver conversational AI applications to their customers with a level of quality and responsiveness that was never before possible."

TensorRT momentum

Nvidia claims that in the five years since its initial release, TensorRT has been downloaded nearly 2.5 million times and used by more than 350,000 developers across 27,500 companies in domains including health care, automotive, finance, and retail. Hugging Face worked with Nvidia to launch AI text analysis, neural search, and conversational AI services, while GE Healthcare tapped the SDK to bolster its computer vision systems for ultrasounds, improving the performance of its cardiac view detection algorithm.

"We're closely collaborating with Nvidia to deliver the best possible performance for state-of-the-art models on Nvidia GPUs," Hugging Face product director Jeff Boudier said in a statement. "With TensorRT 8, Hugging Face achieved 1-millisecond inference latency on BERT, and we're excited to offer this performance to our customers later this year."

TensorRT 8 is now generally available to members of the Nvidia Developer program. The latest versions of plug-ins, parsers, and samples are also available as open source from the TensorRT GitHub repository.

TensorRT momentum

More