A new study from researchers affiliated with the University College London, Nokia Bell Labs Cambridge, and the University of Oxford shows how differences in microphone quality can impact speech recognition accuracy. The coauthors use a custom data set called Libri-Adapt that contains 7,200 hours of English speech to test how well Mozilla’s DeepSpeech model handles unique environments and microphones. The findings suggest a noticeable degradation in accuracy occurs during certain “domain shifts,” with word error rate increasing to as high as 28% after switching microphones.

Automatic speech recognition models must perform well across hardware to be reliable. Customers expect the models powering Alexa to work similarly on different smart speakers, smart displays, and smart devices, for instance. But some models fail to achieve this ideal because they’re not consistently trained with corpora containing speech recorded on microphones of varying quality and in novel settings.

Libri-Adapt is designed to expose these flaws with speech recorded using the microphones in six different products: A PlayStation Eye camera, a generic USB mic, a Google Nexus 6 smartphone, the Shure MV5, a Raspberry Pi accessory called ReSpeaker, and the Matrix Voice developer kit. The corpus has speech data in three English accents (U.S. English, British English, and Indian English), culled from 251 U.S. speakers and synthetic voices generated by Google Cloud Platform’s text-to-speech API. Beyond this, Libra-Adapt contains wind, rain, and laughter background noises as added confounders.

Libra-Adapt word error rate

Above: Word error rate of a fine-tuned DeepSpeech model trained and tested on various microphone pairs for U.S. English speech. The columns correspond to the training microphone domain and rows correspond to the test microphone domain.

During experiments, the researchers compared the speech recognition performance of a pretrained DeepSpeech model (version 0.5.0) across the aforementioned six devices. They found that when data from the same microphone was used for training and testing the model, DeepSpeech unsurprisingly achieved the smallest error rate (e.g., 11.39% in the case of PlayStation Eye). But the inverse was also true: When there was a mismatch between the training and testing sets, the word error rate jumped substantially (e.g., 24.18% when a model trained on PlayStation Eye-recorded speech was tested on Matrix Voice speech).

The researchers say Libra-Adapt, which is available in open source, can be used to create scenarios that test the generalizability of speech recognition algorithms. As an example, they tested a DeepSpeech model trained on U.S.-accented speech collected by a ReSpeaker microphone against Indian-accented speech with rain background noise recorded by a PlayStation Eye. The results show the model suffered an error rate uptick of nearly 29.8%, pointing to poor robustness on the model’s part.

Although the coauthors claim to have manually verified hundreds of Libra-Adapt’s recordings, they caution that some might be incomplete or noisy. They plan to develop unsupervised domain adaptation algorithms in future work to tackle domain shifts in the data set.