Mozilla winds down DeepSpeech development, announces grant program

In 2017, Mozilla launched DeepSpeech, an initiative incubated within the machine learning team at Mozilla Research focused on open sourcing an automatic speech recognition model. Over the next four years, the DeepSpeech team released newer versions of the model capable of transcribing lectures, phone conversations, television programs, radio shows, and other live streams with "human accuracy." But in the coming months, Mozilla plans to cease development and maintenance of DeepSpeech as the company transitions into an advisory role, which will include the launch of a grant program to fund a number of initiatives demonstrating applications for DeepSpeech.

DeepSpeech isn't the only open source project of its kind, but it's among the most mature. Modeled after research papers published by Baidu, the model is an end-to-end trainable, character-level architecture that can transcribe audio in a range of languages. One of Mozilla's major aims was to achieve a transcription word error rate of lower than 10%, and the newest versions of the pretrained English-language model achieve that aim, averaging around a 7.5% word error rate.

It's Mozilla's belief that DeepSpeech has reached the point where the next step is to work on building applications. To this end, the company plans to transition the project to "people and organizations" interested in furthering "use-case-based explorations." Mozilla says it's streamlined the continuous integration processes for getting DeepSpeech up and running with minimal dependencies. And as the company cleans up the documentation and prepares to stop Mozilla staff upkeep of the codebase, Mozilla says it'll publish a toolkit to help people, researchers, companies, and any other interested parties use DeepSpeech to build voice-based solutions.

DeepSpeech: A brief history

Mozilla's work on DeepSpeech began in late 2017, with the goal of developing a model that gets audio features -- speech -- as input and outputs characters directly. The team hoped to design a system that could be trained using Google's TensorFlow framework via supervised learning, where the model learns to infer patterns from datasets of labeled speech.

The latest DeepSpeech model contains tens of millions parameters, or the parts of the model that are learned from historical training data. The Mozilla Research team started training it with a single computer running four Titan X Pascal GPUs but eventually migrated it to two servers with 8 Titan XPs each. In the project's early days, training a high-performing model took about a week.

In the years that followed, Mozilla worked to shrink the DeepSpeech model while boosting its performance and remaining below the 10% error rate target. The English-language model shrank from 188MB to 47MB and memory consumption dropped by 22 times. In December 2019, the team managed to get DeepSpeech running "faster than real time" on a single core of a Raspberry Pi 4.

Mozilla initially trained DeepSpeech using freely available datasets like TED-LIUM and LibriSpeech as well as paid corpora like Fisher and Switchboard, but these proved to be insufficient. So the team reached out to public TV and radio stations, language study departments in universities, and others they thought might have labeled speech data to share. Through this effort, they were able to more than double the amount of training data for the English-language DeepSpeech model.

Inspired by these data collection efforts, the Mozilla Research team collaborated with Mozilla's Open Innovation team to launch the Common Voice project, which seeks to collect and validate speech contributions from volunteers. Common Voice consists not only of voice snippets but of voluntarily contributed metadata useful for training speech engines, like speakers' ages, sex, and accents. It's also grown to include dataset target segments for specific purposes and use cases, like the digits "zero" through "nine" and the words "yes," " no," " hey," and " Firefox."

Today, Common Voice is one of the largest multi-language public domain voice corpora in the world, with more than 9,000 hours of voice data in 60 different languages including widely spoken languages and less-used ones, like Welsh and Kinyarwanda. Over 164,000 people have contributed to the dataset to date.

To support the project's growth, Nvidia today announced that it would invest $1.5 million in Common Voice to engage more communities and volunteers and support the hiring of new staff. Common Voice will now operate under the umbrella of the Mozilla Foundation as part of its initiatives focused on making AI more trustworthy.

Grant program

As it winds down the development of DeepSpeech, Mozilla says its forthcoming grant program will prioritize projects that contribute to the core technology while also showcasing its potential to "empower and enrich" areas that may not otherwise have a viable route toward speech-based interaction. More details will be announced in May, when Mozilla publishes a playbook to guide people on how to use DeepSpeech's codebase as a starting point for voice-powered applications.

"We're seeing mature open source speech engines emerge. However, there is still an important gap in the ecosystem: speech engines -- open and closed -- don't work for vast swaths of the world's languages, accents, and speech patterns," Mark Surman, executive director of the Mozilla Foundation, told VentureBeat via email. "For billions of internet users, voice-enabled technologies simply aren't usable. Mozilla has decided to focus its efforts this side of the equation, making voice technology inclusive and accessible. That means investing in voice data sets rather than our own speech engine. We're doubling down on Common Voice, an open source dataset that focuses on languages and accents not currently represented in the voice tech ecosystem. Common Voice data can be used to feed [open speech] frameworks ... and in turn to allow more people in more places to access voice technology. We're [also] working closely with Nvidia to match up these two sides of the inclusive voice tech equation."