These companies are shrinking the voice recognition 'accent gap'

Speech recognition has come a long way since IBM's Shoebox machine and Worlds of Wonder's Julie doll. By the end of 2018, the Google Assistant will support over 30 languages. Qualcomm has developed on-device models that can recognize words and phrases with 95 percent accuracy. And Microsoft's call center solution is able to transcribe conversations more accurately than a team of humans.

But despite the technological leaps and bounds made possible by machine learning, the voice recognition systems of today are at best imperfect -- and at worst discriminatory. In a recent study commissioned by the Washington Post, popular smart speakers made by Google and Amazon were 30 percent less likely to understand non-American accents than those of native-born users. And corpora like Switchboard, a dataset used by companies such as IBM and Microsoft to gauge the error rates of voice models, have been shown to skew measurably toward speakers from particular regions of the country.

"Data is messy because data is reflective of humanity," Rumman Chowdhury, global responsible AI lead at Accenture, told VentureBeat in an interview, "and that's what algorithms are best at: finding patterns in human behavior."

It's called algorithmic bias: the degree to which machine learning models reflect prejudices in data or design. Countless reports have demonstrated the susceptibility of facial recognition systems -- notably Amazon Web Service's Rekognition -- to bias. And it has been observed in automated systems that predict whether a defendant will commit future crimes, and even in the content recommendation algorithms behind apps like Google News.

Microsoft and other industry leaders such as IBM, Accenture, and Facebook have developed automated tools to detect and mitigate bias in AI algorithms, but few have been particularly vocal (pun intended) about solutions specific to voice recognition.

One that has is Speechmatics. Another is Nuance.

Addressing the 'accent gap'

Speechmatics, a Cambridge tech firm that specializes in enterprise speech recognition software, embarked 12 years ago on an ambitious plan to develop a language pack more accurate and comprehensive than any on the market.

It would have its roots in statistical language modeling and recurrent neural networks, a type of machine learning model that can process sequences of outputs in memory. In 2014, it made a baby step toward its vision with a billion-word corpus for measuring progress in statistical language modeling, and in 2017, it reached another milestone: a partnership with the Qatar Computing Research Institute (QCRI) to develop Arabic speech-to-text services.

"We realized that we [needed] to come up with what we like to call 'one model to rule them all' -- an accent-agnostic language pack that is just as accurate at transcribing [an] Australian accent as it is with Scottish," Speechmatics CEO Benedikt von Thüngen said.

They succeeded in July of this year. The language pack -- dubbed Global English -- is the result of thousands of hours of speech data from over 40 countries and "tens of billions" of words. It supports "all major" English accents for speech-to-text transcription, and it's built on the back of Speechmatic's Automatic Linguist, an AI-powered framework that learns the linguistic foundations of new languages by drawing on patterns identified in known ones.

"Say you have an American on one side of the conversation and an Australian on the other, but the American lived in Canada and picked up a Canadian accent," Ian Firth, vice president of products at Speechmatics explained in an interview. "Most systems have a difficult time handling those types of situations, but ours doesn't.

In tests, Global English has outperformed accent-specific language packs in Google’s Cloud Speech API and the English language pack in IBM's Cloud. Thüngen claims that on the high end, it's between 23 percent and 55 percent more accurate.

Speechmatics isn't the only company claiming to have narrowed the accent gap.

Burlington, Massachusetts-based Nuance says it employs several methods to ensure its voice recognition models understand equally well speakers of the roughly 80 languages its products support.

For its UK voice model, it sourced 20 defined dialect regions and included words particular to each dialect (i.e., using the word "cob" to refer to a bread roll), along with their pronunciations. The resulting language pack recognizes 52 different variations of the word "Heathrow."

But it went one step further. Newer versions of Dragon, Nuance's bespoke speech-to-text software suite, employs a machine learning model that switches automatically between several different dialect models depending on the users' accent. Compared to older versions of the software without the model-switching neural network, it performs 22.5 percent better for English speakers with a Hispanic accent, 16.5 percent better for southern U.S. dialects, and 17.4 better for Southeast Asian speakers of English.

The more data, the better

Ultimately, the accent gap in voice recognition is a data problem. The higher the quantity and diversity of speech samples in a corpus, the more accurate the resulting model -- at least in theory.

In the Washington Post's test, Google Home speakers were 3 percent less likely to give accurate responses to people with Southern accents than those with Western accents, and Amazon's Echo devices performed 2 percent worse with Midwest inflections.

An Amazon spokesperson told the Washington Post that Alexa’s voice recognition is continually improving over time, as more users speak to it with various accents. And Google in a statement pledged to “continue to improve speech recognition for the Google Assistant as we expand our datasets.”

Voice recognition systems will on some level improve as more people begin to use them regularly -- nearly 100 million smart speakers will be sold globally by 2019, according to market research firm Canalys, and roughly 55 percent of U.S. households will own one by 2022.

Just don't expect a silver bullet.

"With today's technology, you're not going to have the most accurate speech for every use case in the entire world," Firth said. "The best you can do is make sure the accuracy is good for people who are trying to use it."

Addressing the 'accent gap'

The more data, the better

More