Debate over accuracy of AI transcription services rages on

Transcription service companies continue to make claims about their AI, but in the absence of any formal set of benchmarks for natural language processing (NLP), most of those claims lack the context required to make a precise apples-to-apples comparison.

Dialpad, for example, recently announced that its Voice Intelligence AI technology has surpassed rivals in terms of both keyword and general accuracy. Along with Google, IBM, Microsoft, Cisco, Chorus.ai, Otter.ai, Avaya, Zoom, and Amazon Web Services (AWS), Dialpad provides end users with automated speech recognition (ASR) and NLP capabilities that can be applied to voice calls and videoconferencing sessions in real time.

An analysis of general word and keyword accuracy Dialpad published claims it achieves a word accuracy rating of 82.3% and a keyword accuracy rating of 92.5%. That's compared to 79.8% for word accuracy and 90.9% for the Google Enhanced service. However, neither service is being evaluated by an independent tester using the same words and keywords running at the same time and spoken in precisely the same way. Dialpad created a collection of test sets that contain audio and the accompanying transcript, which is considered the "ground truth" of what was said in the audio. The company sent the audio to each service evaluated and received a transcript back, which it then compared to the ground truth. Dialpad then calculated the number of errors to determine an accuracy percentage.

In spite of these efforts, it can still be challenging to draw a definitive conclusion concerning how accurate one ASR is over another for specific use cases. There has been work to establish a set of benchmarks like the General Language Understanding Evaluation (GLUE) effort that seeks to evaluate ASRs based on accuracy within the context of a sentence. There are also initiatives such as Fisher and Switchboard to create standard datasets for academics to evaluate ASR systems. Thus far, however, no benchmark consensus has emerged. Even if such a consensus is reached, the jargon employed across industries tends to vary. AI transcription services that would be applied in, for example, the health care sector will need to be trained to understand specific nomenclature.

Less clear is to what degree such claims are likely to impact any decision to standardize transcription services. It's still early as far as transcription services are concerned, so most end user expectations are not all that high, said Zeus Kerravala, founder and principal analyst for ZK Research. "At this stage, a lot of end users expect there to be errors," Kerravala said.

Dialpad believes it will gradually become more apparent that embedding ASR capabilities within a communications platform is superior to approaches that rely on application programming interfaces (APIs) to access a speech-to-text service from a cloud service provider. The company acquired TalkIQ in 2018, which has enabled the company to embed these capabilities as a set of microservices that run natively on its core communications platforms, Dialpad CEO Craig Walker said.

After continuously updating its platform over the last several years, the company has now analyzed more than 1 billion minutes of voice calls, Walker said. Each of those calls has enabled the Voice AI technology created by TalkIQ to both transcribe conversations more accurately and surface sentiment analytics in real time. The Dialpad contact center platform, for example, can recognize when sentiment turns negative during a call and alert a manager, Walker said. "It becomes part of the workflow," Walker said.

Organizations can also create their own custom dictionary of terms, and the Voice Intelligence AI platform will learn to address use cases that might be unique to an industry or lexicon only employed in a specific region, Walker said.

It's not clear to what degree organizations are evaluating accuracy as a criterion for selecting one conversational AI platform over another. Avaya and other rivals are employing a mix of AI engines developed internally with services their platforms call externally via an API. However, Walker said it will continue to become apparent that conversational AI engines that run natively within a platform are not only more efficient but also less costly to implement because the amount of systems integration effort required is sharply reduced. There is no need to first set up and then maintain APIs to call an external cloud service, Walker said.

Regardless of the platform employed, the level of data mining applied to voice calls in real time is about to substantially increase. Previously, data mining could only be applied after a call was recorded and transcribed into text. A sentiment analytics report would then be generated long after the initial call or video conference ended.

The fact that voice calls are now being analyzed in real time is likely to have a profound impact on how individuals interact with one another. In many cases, one of the reasons individuals still prefer to make voice calls instead of sending an email is that they don't want the substance of those communications to be recorded.

Regardless of the intent, however, the days when voice calls were exempt from the level of analytics already being applied to other communications mediums are clearly coming to an end.

More