Some automatic speech recognition (ASR) systems might be less accurate than previously assumed. That’s the top-level finding of a recent study by researchers at Johns Hopkins University, the Poznań University of Technology in Poland, the Wrocław University of Science and Technology, and startup Avaya, which benchmarked commercial speech recognition models on an internally created dataset. The coauthors claim that the word error rates (WER) — a common speech recognition performance metric — were significantly higher than the best reported results and that this could indicate a wider-ranging problem in the field of natural language processing (NLP).

ASR has become ubiquitous; it dictates meetings and emails, helps to manage smart appliances, and more. A comprehensive benchmark of ASR models cites WER as low as 2% to 3% on standard corpora, but the coauthors of this latest report reject that statistic. The majority of interactions with ASRs happen in the context of “chatbot-like interactions,” they claim, where people are aware they’re conversing with a machine and thus simplify their commands to short, well-structured phrases as opposed to the disfluent hallmarks of natural conversation.

The coauthors evaluated several ASR systems on a dataset of 50 call center conversations from 1,595 agents and 1,261 customers, which spanned 8.5 hours in length — 2.2 hours of which was speech. Depending on the dataset, the ASR systems’ previously published error rates didn’t exceed 15% and dropped as low as 2%. This was in contrast with the study’s findings; tested across recorded phone conversations about finance, insurance, telecom, and booking, the coauthors observed WER as high as 23.31%. The highest rates were on the booking and telecom calls, perhaps because the conversations referred to specific dates and times, money, places, and product and company names. But WER was above 13.73% in every domain.

Automatic speech recognition WER

The researchers attribute the disparity to the simplicity of frequently used benchmarks like Librispeech (1,000 hours of English audiobook recordings), WSJ (dictations and conversations from journalists), and Switchboard (phone exchanges), which they say might be too simple to truly challenge ASRs. Even more holistic benchmarks suffer from the “domain adaptation problem” — while they attempt to mimic real, spontaneous conversations, they’re inherently artificial because they involve pairs of voice actors having a conversation on subjects drawn from agreed-upon topics. Other benchmark datasets come from scripted or semi-scripted conversations like TED Talks. Moreover, the datasets tend to be homogeneous with respect to voice actor demographics. Non-native language speakers are virtually absent from benchmark datasets, and factors like pronunciation, linguistics, and gender often aren’t accounted for.

“Benchmark datasets do not represent the true diversity of real-world conversations, both at input signal characteristics and conversation semantics levels,” the coauthors wrote. “The domain of application imposes strict constraints on the vocabulary and the form of the conversations … There are consequential differences between scripted and spontaneous conversations and they affect the results of the ASR evaluation.”

As a remedy, the researchers suggest the ASR and NLP communities collect and annotate audio datasets better aligned with contemporary applications of ASR systems. They also call for work on extended and more inclusive acoustic models representing a broader spectrum of dialects, as well as models that account for technological advances that influence physical properties of processed audio signals.

“These problems are not insurmountable. A thoughtful collaboration between academia and industry partners can lead to the creation of high-quality training and testing datasets,” the researchers continued. “We believe that the overly optimistic perception of ASR accuracy is detrimental to the development of conversational natural language processing downstream applications.”