Oto raises $5.3 million to improve speech recognition with intonation data

It's estimated that one in five people in the U.S. interact with a smart speaker on a daily basis and that the share of Google searches conducted by voice in the country recently surpassed 30%. Partly responsibly for the adoption is the increasing sophistication of automatic speech recognition systems, the best of which recognize speech with accuracy matching or exceeding that of humans. But in spite of this, there's been comparatively little work in intonation classification, which by one measure could reveal 5 times as much information as words alone.

That's why a group of scientists at SRI International, the Menlo Park, California-based research institute perhaps best known for incubating Siri (prior to its acquisition by Apple), drew on expertise in behavioral science and AI to develop a novel approach to language understanding. The fruit of their labor tapped machine learning algorithms to bridge acoustic modalities (tone) and lexical modalities (words) to facilitate superior speech analysis.

This "acoustic language processing" technology was eventually spun off by Teo Borschberg and former Hyperloop Transportation Technologies AI team lead Nicolas Perony, who commercialized it under the banner of New York-based startup Oto. Oto aims to do no less than disrupt the $49 billion voice recognition market, and to this end it announced that it has raised $5.3 million in funding from Firstminute Capital, Fusion Fund, Interlace Ventures, SAP.iO, and SRI International.

The capital infusion will fuel product development, according to Perony. "With the release of ... our new deep learning framework for acoustic modeling at scale, we can now deploy language-agnostic models that further improve over time through supervised learning on ... labels, and unsupervised learning on raw human emotions in enterprise data sets," he said. "Through our ... platform, enterprises can leverage OTO to provide ... agents with real-time coaching and automating parts of quality assurance."

A new data set

Oto's conversational system leverages lexical information to understand opinions and acoustic information and decipher voiced emotions, according to Borschberg. But getting to this point was anything but easy.

Building on the SRI team's technical scaffolding, Oto compiled one of the largest sets of emotionally tagged speech data -- the Oto Emotion Dataset -- containing over 10,000 utterances from 3,000 speakers. (The goal is to reach 1 million hours by 2020.) By modeling tones together with phrases and words, an AI classifier trained on the corpus learned the relationships between what was spoken and how it was spoken, making use of an encoding scheme Borschberg describes as "acoustically aware" embeddings.

He claims that in an experiment designed to model the arousal (i.e., degree of intensity) of an emotion, which can inform the direction a conversation might take, Oto's systems were up to 40% more accurate than models that didn't factor in acoustics. Subsequent tests showed that the models were able to distinguish equally well between angry and sad data samples as between happy and sad samples -- up to 60% better in the case of the former compared with baseline text-only classifiers.

A powerful platform

The data set and model underpin Oto's web-hosted toolset, which is designed to be language-independent and plug-and-play for clients -- chiefly in the customer service segment. Its suite provides call center agents with in-call coaching to elevate their overall performance while at the same time optimizing the quality assurance process with targeted call sampling and objective metrics.

Oto's tools take advantage of SRI's SenSay Analytics, which performs real-time speaker state classification from spoken audio. It effectively transforms a spoken conversation into thousands of acoustic properties every second, building a live map of how an interaction is unfolding and allowing the system to drill down into the second-by-second acoustic structure.

Concretely, an interactive visual widget helps the agents employed by Oto's customers remain engaged during calls. (Agents see messages like "We're noticing low energy levels -- try to sound more engaged" when they slip into monotone.) Live dashboards surface real-time metrics and allow managers to play back calls, and root cause and topic modeling tools enable the targeting of key moments (like interest to purchase and satisfaction) and trigger automation.

Borschberg says Oto has to date extracted over 3 billion intonation measurements from customer conversations, which have helped it model different sets of behaviors at over 90% accuracy on intonation. Moreover, in one pilot deployment, Oto's coaching tools increased overall conversation engagement by 19%, leading to an increased sales conversion rate of 5% on "tens of thousands" of inbound calls. (Customers like ACD Direct say they've seen conversion rates increase by up to 18%.)

In a separate evaluation involving 4,000 hours of inbound sales conversations with an approximately 50% conversion rate, Oto trained its models to capture the acoustic signature of a successful sale and tested them against recordings its models had never heard. Oto reached 94% accuracy in predicting the outcome of a call from its acoustics alone, Borschberg claims.

"We are building the next generation of speech technology to humanize conversations by unlocking the trove of behavioral insights found in our daily communications," said Borschberg. "We're finally emerging from the research phase and are thrilled to be deploying our technology in the U.S. and EU to help enterprises better understand human behaviors like engagement, interest to purchase, and satisfaction."