Nvidia takes on Meta and Google in the speech AI technology race

At Nvidia’s Speech AI Summit today, the company discussed its new speech artificial intelligence (AI) ecosystem, which it developed through a partnership with Mozilla Common Voice. The ecosystem focuses on developing crowdsourced multilingual speech corpuses and open-source pretrained models. Nvidia and Mozilla Common Voice aim to accelerate the growth of automatic speech recognition models that work universally for every language speaker worldwide.

Nvidia found that standard voice assistants, such as Amazon Alexa and Google Home, support fewer than 1% of the world’s spoken languages. To solve this problem, the company aims to improve linguistic inclusion in speech AI and expand the availability of speech data for global and low-resourced languages.

Nvidia is joining a race that both Meta and Google are already running: Recently, both companies released speech AI models to aid communication among people who speak different languages. Google’s speech-to-speech AI translation model, Translation Hub, can translate a large volume of documents into many different languages. Google also just announced it is building a universal speech translator, trained on over 400 languages, with the claim that it is the “largest language model coverage seen in a speech model today.”

At the same time, Meta AI’s universal speech translator (UST) project helps create AI systems that enable real-time speech-to-speech translation across all languages, even those that are spoken but not commonly written.

An ecosystem for global language users

According to Nvidia, linguistic inclusion for speech AI has comprehensive data health benefits, such as helping AI models understand speaker diversity and a spectrum of noise profiles. The new speech AI ecosystem helps developers build, maintain and improve the speech AI models and datasets for linguistic inclusion, usability and experience. Users can train their models on Mozilla Common Voice datasets, and then offer those pretrained models as high-quality automatic speech recognition architectures. Then, other organizations and individuals across the globe can adapt and use those architectures for building their speech AI applications.

“Demographic diversity is key to capturing language diversity,” said Caroline de Brito Gottlieb, product manager at Nvidia. “There are several vital factors impacting speech variation, such as underserved dialects, sociolects, pidgins and accents. Through this partnership, we aim to create a dataset ecosystem that helps communities build speech datasets and models for any language or context.”

The Mozilla Common Voice platform currently supports 100 languages, with 24,000 hours of speech data available from 500,000 contributors worldwide. The latest version of the Common Voice dataset also features six new languages — Tigre, Taiwanese (Minnan), Meadow Mari, Bengali, Toki Pona and Cantonese, as well as more speech data from female speakers.

Through the Mozilla Common Voice platform, users can donate their audio datasets by recording sentences as short voice clips, which Mozilla validates to ensure dataset quality upon submission.

_{Image Source: Mozilla Common Voice.}

“The speech AI ecosystem extensively focuses on not only the diversity of languages, but also on accents and noise profiles that different language speakers across the globe have,” Siddharth Sharma, head of product marketing, AI and deep learning at Nvidia, told VentureBeat. “This has been our unique focus at Nvidia and we created a solution that can be customized for every aspect of the speech AI model pipeline.”

Nvidia’s current speech AI implementations

The company is developing speech AI for several use cases, such as automatic speech recognition (ASR), artificial speech translation (AST) and text-to-speech. Nvidia Riva, part of the Nvidia AI platform, provides state-of-the-art GPU-optimized workflows for building and deploying fully customizable, real-time AI pipelines for applications like contact center agent assists, virtual assistants, digital avatars, brand voices, and video conferencing transcription. Applications developed through Riva can be deployed across all cloud types and data centers, at the edge, or on embedded devices.

NCS, a multinational company and a transportation technology partner of the Singapore government, customized Nvidia’s Riva FastPitch model and built its own text-to-speech engine for English-Singapore using local speakers’ voice data. NCS recently designed Breeze, a local driver’s app that translates languages including Mandarin, Hokkien, Malay and Tamil into Singaporean English with the same clarity and expressiveness as a native Singaporean would speak them.

Mobile communication conglomerate T-Mobile also partnered with Nvidia to develop AI-based software for its customer experience centers that transcribes real-time customer conversations and recommends solutions to thousands working on the front line. To create the software, T-Mobile utilized Nvidia NeMo, an open-source framework for state-of-the-art conversational AI models, alongside Riva. These Nvidia tools enabled T-Mobile engineers to fine-tune ASR models on T-Mobile’s custom datasets and interpret customer jargon accurately across noisy environments.

Nvidia’s future focus on speech AI

Sharma says that Nvidia aims to inculcate current developments of AST and next-gen speech AI into real-time metaverse use cases.

“Today, we’re limited to only offering slow translation from one language to the other, and those translations have to go through text,” he said. “But the future is where you can have people in the metaverse across so many different languages all being able to have instant translation with each other,” he said.

“The next step,” he added, “is developing systems that will enable fluid interactions with people across the globe through speech recognition for all languages and real-time text-to-speech.”

An ecosystem for global language users

Nvidia’s current speech AI implementations

Nvidia’s future focus on speech AI

More