At its GTC 2021, Nvidia this morning announced the general availability of its Jarvis framework, which provides developers with pretrained AI models and software tools to create interactive conversational experiences. Nvidia says that Jarvis models, which first became available in May 2020 in preview, offer automatic speech recognition, as well as language understanding, real-time language translations, and text-to-speech capabilities for conversational agents.
The ubiquity of smartphones and messaging apps — spurred by the pandemic — have contributed to the increased adoption of conversational technologies. Fifty-six percent of companies told Accenture in a survey that conversational bots and other experiences are driving disruption in their industry. And a Twilio study showed 9 out of 10 consumers would like the option to use messaging to contact a business.
Leveraging GPU acceleration, Jarvis’ pipeline can be run in under 100 milliseconds and deploy in the cloud, in a datacenter, or at the edge. The framework includes models trained on over 1 billion pages of text and over 60,000 hours of speech that can be adjusted, optimized, fine-tuned with custom data, and tailored to different tasks, industries, and systems.
T-Mobile is among Jarvis’ early users, and Jarvis — which supports five languages including English, Chinese, and Japanese — has racked up more than 45,000 downloads since becoming available early last year. According to Nvidia, the telecom giant is using the framework to help resolve customer service issues in real time.
Even before the pandemic, autonomous agents were on the way to becoming the rule rather than the exception, partly because consumers prefer it that way. According to research published last year by Vonage subsidiary NewVoiceMedia, 25% of people prefer to have their queries handled by a chatbot or other self-service alternative. And Salesforce says roughly 69% of consumers choose chatbots for quick communication with brands.
Nvidia also announced that it’s partnering with Mozilla Common Voice, an open source collection of voice data for startups, researchers, and developers to train voice-enabled apps, services, and devices. The world’s largest multi-language public domain voice dataset, Common Voice contains over 9,000 total hours of contributed voice data in 60 different languages. Nvidia says it’s using Jarvis to develop pretrained models with the dataset that it will then offer to the community for free.
“We launched Common Voice to teach machines how real people speak in their unique languages, accents, and speech patterns,” Mozilla executive director Mark Surman said in a press release. “Nvidia and Mozilla have a common vision of democratizing voice technology — and ensuring that it reflects the rich diversity of people and voices that make up the internet.”
Newly revealed features in Jarvis will be released in the second quarter of 2021 as part of Nvidia’s ongoing open beta program. Developers can download the framework today from Nvidia’s NGC catalog.
VentureBeatVentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative technology and transact. Our site delivers essential information on data technologies and strategies to guide you as you lead your organizations. We invite you to become a member of our community, to access:
- up-to-date information on the subjects of interest to you
- our newsletters
- gated thought-leader content and discounted access to our prized events, such as Transform 2021: Learn More
- networking features, and more