MLCommons releases open source datasets for speech recognition

Let the OSS Enterprise newsletter guide your open source journey! Sign up here.

MLCommons, the nonprofit consortium dedicated to creating open AI development tools and resources, today announced the release of the People's Speech Dataset and the Multilingual Spoken Words Corpus. The consortium claims that the People's Speech Dataset is among the world's most comprehensive English speech datasets licensed for academic and commercial usage, with tens of thousands of hours of recordings, and that the Multilingual Spoken Words Corpus (MSWC) is one of the largest audio speech datasets with keywords in 50 languages.

No-cost datasets such as TED-LIUM and LibriSpeech have long been available for developers to train, test, and benchmark speech recognition systems. But some, like Fisher and Switchboard, require licensing or relatively high one-time payments. This puts even well-resourced organizations at a disadvantage compared with tech giants such as Google, Apple, and Amazon, which can gather large amounts of training data through devices like smartphones and smart speakers. For example, four years ago, when researchers at Mozilla began developing the English-language speech recognition system DeepSpeech, the team had to reach out to TV and radio stations and language departments at universities to supplement the public speech data that they were able to find.

With the release of the People's Speech Dataset and the MSWC, the hope is that more developers will be able to build their own speech recognition systems with fewer budgetary and logistical constraints than previously, according to Keith Achorn. Achorn, a machine learning engineer at Intel, is one of the researchers who's overseen the curation of the People's Speech Dataset and the MSWC over the past several years.

"Modern machine learning models rely on vast quantities of data to train. Both 'The People’s Speech' and 'MSWC' are among the largest datasets in their respective classes. MSWC is of particular interest for its inclusion of 50 languages," Achorn told VentureBeat via email. "In our research, most of these 50 languages had no keyword-spotting speech datasets publicly available until now, and even those which did had very limited vocabularies."

Open-sourcing speech tooling

Starting in 2018, a working group formed under the auspices of MLCommons to identify and chart the 50 most-used languages in the world into a single dataset -- and figure out a way to make the dataset useful. Members of the team came from Harvard and the University of Michigan as well as Alibaba, Oracle, Google, Baidu, Intel, and others.

The researchers who put the dataset together were an international group hailing from the U.S., South America, and China. They met weekly for several years via conference call, each bringing a particular expertise to the project.

The project eventually spawned two datasets instead of one -- the People's Speech Dataset and the MSWC -- which are individually detailed in whitepapers being presented this week at the annual Conference on Neural Information Processing Systems (NeurIPS). The People's Speech Dataset targets speech recognition tasks, while MSWC involves keyword spotting, which deals with the identification of keywords (e.g., "OK, Google," "Hey, Siri") in recordings.

People's Speech Dataset versus MSWC

The People's Speech Dataset involves over 30,000 hours of supervised conversational audio released under a Creative Commons license, which can be used to create the kind of voice recognition models powering voice assistants and transcription software. On the other hand, MSWC -- which has more than 340,000 keywords with upwards of 23.4 million examples, spanning languages spoken by over 5 billion people -- is designed for applications like call centers and smart devices.

Previous speech datasets relied on manual efforts to collect and verify thousands of examples for individual keywords, and were commonly restricted to a single language. Moreover, these datasets didn't leverage "diverse speech," meaning that they poorly represented a natural environment -- lacking accuracy-boosting variables like background noise, informal speech patterns, and a mixture of recording equipment.

Both the People's Speech Dataset and the MSWC also have permissive licensing terms, including commercial use, which stands in contrast to many speech training libraries. Datasets typically either fail to formalize their licenses, relying on end-users to take responsibility, or are restrictive in the sense that they prohibit use in products bound for the open market.

"The working group envisioned several use cases during the development process. However, we are also aware that these spoken word datasets may find further use by models and systems we did not yet envision," Achorn continued. "As both datasets continue to grow and develop under the direction of MLCommons, we are seeking additional sources of high-quality and diverse speech data. Finding sources which comply with our open licensing terms makes this more challenging, especially for non-English languages. On a more technical level, our pipeline uses forced alignment to match speech audio with transcript text. Although methods were devised to compensate for mixed transcript quality, improving accuracy comes at a cost to the quantity of data."

Open source trend

The People's Speech Dataset complements the Mozilla Foundation's Common Voice, another of the largest speech datasets in the world, with more than 9,000 hours of voice data in 60 different languages. In a sign of growing interest in the field, Nvidia recently announced that it would invest $1.5 million in Common Voice to engage more communities and volunteers and support the hiring of new staff.

Recently, voice technology has surged in adoption among enterprises in particular, with 68% of companies reporting they have a voice technology strategy in place, according to Speechmatics -- an 18% increase from 2019. And among the companies that don't, 60% plan to in the next five years.

Building datasets for speech recognition remains a labor-intensive pursuit, but one promising approach coming into wider use is unsupervised learning, which could cut down on the need for bespoke training libraries. Traditional speech recognition systems require examples of speech labeled to indicate what's being said, but unsupervised systems can learn without labels by picking up on subtle relationships within the training data.

Researchers at Guinea-based tech accelerator GNCode and Stanford have experimented with using radio archives in creating unsupervised systems for "low-resource" languages, particularly Maninka, Pular, and Susu in the Niger Congo family. A team at MLCommons called 1000 Words in 1000 Languages is creating a pipeline that can take any recorded speech and automatically generate clips to train compact speech recognition models. Separately, Facebook has developed a system, dubbed Wave2vec-U, that can learn to recognize speech from unlabeled data.