Nvidia's TensorRT 7 improves the compiler for conversational AI models

A partnership with Didi Chuxing and new autonomous driving solutions weren't the only things Nvidia announced at its GPU Technology Conference in Suzhou today. The chip firm took the opportunity to introduce TensorRT 7, the newest release of its platform for high-performance deep learning inference on graphics cards, which ships with an improved compiler optimized for real-time inferencing workloads.

TensorRT 7 will be available in the coming days from the TensorRT webpage without charge to members of Nvidia's Developer program, and the latest versions of plugins, parsers, and samples are live on the TensorRT GitHub repository. The platform, which ships alongside Cuda-X AI libraries as a part of Nvidia's inference suite, can validate and deploy a trained neural network for inference regardless of hardware, whether a datacenter or an embedded device with a graphics card. The company notes that some of the world's largest brands, including Alibaba, American Express, Baidu, Pinterest, Snap, Tencent, and Twitter, are using TensorRT for tasks like image classification, fraud detection, segmentation, and object detection.

"We have entered a new chapter in AI, where machines are capable of understanding human language in real time," said Nvidia founder and CEO Jensen Huang during a keynote address, citing a Juniper Research study predicting there will be 8 billion devices with digital assistants in use by 2023, up from 3.25 billion today. "TensorRT 7 helps make this possible, providing developers everywhere with the tools to build and deploy faster, smarter conversational AI services that allow more natural human-to-AI interaction."

The aforementioned compiler automatically accelerates the recurrent and Transformer-based machine learning models required for sophisticated speech applications, according to Huang. Transformers are a type of architecture researchers at Google Brain, Google's AI research division, introduced that contains functions (neurons) arranged in layers that transmit signals from data and adjust connections' synaptic strength (weights). That's how all AI models extract features and learn to make predictions, but Transformers uniquely have attention such that every output element is connected to every input element, forcing the weightings between them to be calculated dynamically.

TensorRT 7 ostensibly speeds up both Transformer and recurrent network components -- including popular networks like DeepMind's WaveRNN and Google's Tacotron 2 and BERT -- by more than 10 times compared with processor-based approaches, while driving latency below the 300-millisecond threshold considered necessary for real-time interactions. That's partly thanks to optimizations targeting recurrent loop structures, which are used to make predictions on time-series sequence data like text and voice recordings.

More