Mozilla updates DeepSpeech with an English language model that runs 'faster than real time'

DeepSpeech, a suite of speech-to-text and text-to-speech engines maintained by Mozilla’s Machine Learning Group, this morning received an update (to version 0.6) that incorporates one of the fastest open source speech recognition models to date. In a blog post, senior research engineer Reuben Morais lays out what's new and enhanced, as well as other spotlight features coming down the pipeline.

The latest version of DeepSpeech adds support for TensorFlow Lite, a version of Google's TensorFlow machine learning framework that's optimized for compute-constrained mobile and embedded devices. It has reduced DeepSpeech's package size from 98MB to 3.7MB and its built-in English model size -- which has a 7.5% word error rate on a popular benchmark and which was trained on 5,516 hours of transcribed audio from WAMU (NPR), LibriSpeech, Fisher, Switchboard, and Mozilla's Common Voice English data sets -- from 188MB to 47MB. Plus, it has cut down DeepSpeech's memory consumption by 22 times and boosted its startup speed by over 500 times.

This more efficient English language model -- which runs "faster than real time" on a single core of a Raspberry Pi 4 and which is 50% smaller than before (including the inference code and the trained model) -- is available on Windows, macOS, and Linux as well as Android.

DeepSpeech 0.6 is much more performant overall, thanks in part to a new streaming decoder that enables "consistent" low latency and memory utilization regardless of the length of audio being transcribed. Additionally, the platform's two main subsystems -- an acoustic model that receives audio features as inputs and outputs character probabilities, plus a decoder that transforms character probabilities into textual transcripts -- are both now capable of streaming. This means that there's no longer any need for carefully tuned silence detection algorithms, said Morais.

The new DeepSpeech provides transcriptions 260 milliseconds after the end of the audio, or 73% faster than before the streaming decoder was implemented. As for intermediate transcript requests at seconds 2 and 3 of audio files, they're returned in a fraction of the time.

That's not all that's improved on the performance side of the equation. Now, thanks to an upgrade to TensorFlow 1.14 and the adoption of newly available APIs, DeepSpeech is up to two times faster when it comes to model training. Moreover, it's capable of fully training and deploying models at different sample rates (e.g., 8kHz for telephony data), and the new decoder exposes timing and confidence metadata for each character in the transcript.

Lastly, DeepSpeech now offers packages for Windows, with .NET, Python, JavaScript, and C bindings, the first of which is available in the NuGet Gallery and which can be installed from Visual Studio directly. An example is available in DeepSpeech's repository, containing code demonstrating transcription from an audio file and from a microphone or other audio input device.

Mozilla's work in natural language processing extends to the aforementioned Common Voice data set, which was recently updated with 1,400 hours of speech across 18 languages. It's one of the largest multi-language dataset of its kind, Mozilla claims -- substantially larger than the Common Voice corpus it made publicly available eight months ago, which contained 500 hours (400,000 recordings) from 20,000 volunteers in English -- and it'll soon grow larger still. The organization says that data collection efforts in 70 languages are actively underway via the Common Voice website and mobile apps.