Two years ago, researchers at IBM claimed state-of-the-art transcription performance with a machine learning system trained on two public speech recognition data sets, which was more impressive than it might seem. The AI system had to contend not only with distortions in the training corpora’s audio snippets, but with a range of speaking styles, overlapping speech, interruptions, restarts, and exchanges among participants.
In pursuit of an even more capable system, researchers at the Armonk, New York-based company recently devised an architecture detailed in a paper (“English Broadcast News Speech Recognition by Humans and Machines“) that will be presented at the International Conference on Acoustics, Speech, and Signal Processing in Brighton this week. They say that in preliminary experiments it achieved industry-leading results on broadcast news captioning tasks.
Getting to this point wasn’t easy. The system came with its own set of challenges, like audio signals with lots of background noise and presenters speaking on a wide variety of news topics. And while most of the training corpora’s speech was well-articulated, it contained materials such as onsite interviews, clips from TV shows, and other multimedia content.
As IBM researcher Samuel Thomas explains in a blog post, the AI leverages a combination of long short-term memory (LSTM) — a type of algorithm capable of learning long-term dependencies — and acoustic neural network language models, along with complementary language models. The acoustic models contained up to 25 layers of nodes (mathematical functions mimicking biological neurons) trained on speech spectrograms, or visual representations of signal spectrums, while the six-layer LSTM networks learned a “rich” set of various acoustic features to enhance language modeling.
After feeding the entire system 1,300 hours of broadcast news data published by the Linguistic Data Consortium, an international nonprofit supporting language-related education, research, and technology development, the researchers set the AI loose on a test set containing two hours of data from six shows with close to 100 overlapping speakers altogether. (A second test set contained four hours of broadcast news data from 12 shows with about 230 overlapping speakers.) The team worked with speech and search tech firm Appen to measure recognition error rates on speech recognition tasks and report that the system achieved 6.5% on the first test set and 5.9% on the second — a bit worse than human performance at 3.6% and 2.8%, respectively.
“[Our] new results … are the lowest we are aware of for this task, [but] there is still room for new techniques and improvements in this space,” wrote Thomas.