Nvidia's Riva Custom Voice lets companies create custom voices powered by AI

At its fall 2021 GPU Technology Conference (GTC), Nvidia unveiled Riva Custom Voice, a new toolkit that the company claims can enable customers to create custom, "human-like" voices with only 30 minutes of speech recording data. According to Nvidia, businesses can use Riva Custom Voice to develop a virtual assistant with a unique voice, while call centers and developers can leverage it to launch brand voices and apps to support people with speech and language disabilities.

Brand voices like Progressive's Flo are often tasked with recording phone trees and elearning scripts in corporate training video series. For companies, the costs can add up -- one source pegs the average hourly rate for voice actors at $39.63, plus additional fees for interactive voice response (IVR) prompts. Synthesization could boost actors' productivity by cutting down on the need for additional recordings, potentially freeing the actors up to pursue more creative work -- and saving businesses money in the process.

For example, Progressive used AI to create a Facebook Messenger chatbot with the voice of Stephanie Courtney, who plays Flo. KFC in Canada built a voice in a Southern U.S. English accent for the chain's ambassador, Colonel Sanders, in the company's Amazon Alexa app. Duolingo is employing AI to create voices for characters in its language learning apps. And National Australia Bank has deployed an AI-powered Australian English voice for the customers who call into its contact centers.

"Human-like interactions have long been one of the greatest challenges of AI, especially for companies with industry-specific jargon," Nvidia VP of product management for AI Kari Briski said in a blog post. "Now these companies can use speech AI to listen and respond to customers with an expressive voice that’s unique to their brand and that drives more engaging and delightful interactions.

Voice synthesis

Riva Custom Voice, which is available in the latest version of Nvidia's Riva conversational AI software development kit, leverages semi-supervised learning to create synthetic, bespoke voices for software, IVRs, and other business applications. In semi-supervised learning -- one of several types of AI training techniques -- machine learning algorithms determine the correlations between data points and then use a small amount of labeled data to mark those points. The system is then trained based on the newly applied data labels, eliminating the need to manually label all data.

Semi-supervised learning is applicable to range of real-world problems where a small amount of labeled data would prevent supervised learning algorithms from functioning. (Supervised learning requires that all data be labeled in order to complete the training process.) For example, it can alleviate the data prep burden in speech analysis, where labeling audio files is typically very labor-intensive.

Nvidia says that for small-scale research and development, Riva Custom Voice will launch in open beta at no cost on the Nvidia NGC container registry. For customers with large-scale deployments, there's Riva Enterprise, a newly announced program that's expected to become available early next year and will offer technical support from Nvidia experts, the company says.

With Riva Custom Voice, Nvidia is effectively going toe to toe with Google, which in 2019 debuted new AI-synthesized WaveNet voices and standard voices in its Cloud Text-to-Speech service. Nvidia has another rival in Amazon, which recently launched a service -- Brand Voice -- that taps AI to generate custom spokespeople and offers a number of voice styles and emotion styles through Amazon Polly. For its part, in February, Microsoft launched a synthetic voice generation service called Custom Neural Voices in limited access.

Potential misuse

AI-powered voices can deliver brand consistency, which research shows is one of the keys to increased customer loyalty. According to a survey conducted by Wunderman and Adobe, 63% of customers say that the best brands exceed expectations across the customer journey. A separate survey by Forrester found that 69% of U.S. consumers shop more with brands that offer consistent experiences in store and online.

But the technology can also be misused, as in the case of a CEO whose voice was imitated convincingly enough to initiate a wire transfer of $243,000. The constant Zoom meetings of the "anywhere workforce era" have created a wealth of audio and video data that can be fed into a machine learning system to create a compelling duplicate, VMware's Rick McElroy points out. According to the FBI, malicious actors are set to leverage synthetic content for cyber and foreign influence operations, perhaps as soon as within the next 12 months.

Some providers require that voice actors consent to use of the technology, review each potential use case, and have customers sign a code of conduct prior to deploying a synthetic voice. Microsoft has said that it's working on a way to embed a digital watermark within a synthetic voice to indicate that the content was created with Custom Neural Voice. Others like Resemble AI, a voice synthesis startup, have released open source tools designed to detect voice "deepfakes."

Nvidia didn't initially announce protections to prevent the abuse of Riva Custom Voice, but in its Riva terms of service, the company prohibits the creation of "fraudulent, false, misleading, or deceptive" content as well as content that "promote[s] discrimination, bigotry, racism, hatred, harassment, or harm against any individual or group." We'll update this piece once additional information is released.

Voice synthesis

Potential misuse

More