Facebook details wav2vec, an AI algorithm that uses raw audio to improve speech recognition

Automatic speech recognition, or ASR, is a foundational part of not only assistants like Apple's Siri, but dictation software such as Nuance's Dragon and customer support platforms like Google's Contact Center AI. It's the thing that enables machines to parse utterances for key phrases and words and that allows them to distinguish people by their intonations and pitches.

Perhaps it goes without saying that ASR is an intense area of study for Facebook, whose conversational tech is used to power Portal's speech recognition and who is broadening the use of AI to classify content on its platform. To this end, at the InterSpeech conference earlier this year the Menlo Park company detailed wave2vec, a novel machine learning algorithm that improves ASR accuracy by using raw, untranscribed audio as training data. Facebook claims it achieves state-of-the-art results on a popular benchmark while using two orders of magnitude less training data and that it demonstrates a 22% error reduction over the leading character-based speech recognition system, Deep Speech 2.

Wav2vec was made available earlier this year as an extension to the open source modeling toolkit fairseq, and Facebook says it plans to use wav2vec to provide better audio data representations for keyword spotting and acoustic event detection. Additionally, it hopes to improve its existing systems that proactively identify posts in violation of its community guidelines.

"Wav2vec represents a step forward for ASR systems, and it's a promising direction for recognizing speech in languages that do not have extensive data sets for training AI systems," wrote Facebook research scientists and software engineers Michael Auli, Siddhartha Shah, Alexei Baevski, and Christian Fuegen in a blog post. "But it's also part of our long-term vision for self-supervised training, an approach that takes advantage of unlabeled training examples and enables us to move beyond the comparatively limited number of data sets that have been gathered and annotated specifically for training AI systems."

Alongside wav2vec, Facebook showcased a new self-supervision model -- ConvLM -- that achieves state-of-the-art performance in correctly recognizing words outside of its training lexicon, and a lightweight sequence-to-sequence (seq2seq) model for speech recognition that's reportedly more efficient than previous work while delivering a better WER. Both were also presented at Interspeech in Graz, Austria in September.

Building wav2vec

As Auli and colleagues explained in the submitted paper, ASR systems are typically trained on audio sequences in the form of spectrograms (representations of the spectrum of frequencies over time) and corresponding text. Predictably, procuring these examples requires labeling vast amounts of audio data by hand, which takes valuable time and resources. By contrast, wav2vec is self-supervised, meaning it uses unlabeled data in conjunction with small amounts of labeled data.

Wav2vec first trains a model to distinguish between true data and a set of distractor samples, which helps it to learn the mathematical representations of audio data on which it trains. An encoder model maps raw audio input to sets of vectors (arrays of numbers with values corresponding to features) where each vector covers about 30 milliseconds of speech, while a context model uses the vectors to generate its own representations covering up to a second of audio.

With those representations in hand, wav2vec next attempts to solve a series of self-supervision prediction tasks by generating shorter, 10-millisecond-long distractor examples from the 10-second audio clips on which it trained. For these distractor samples, the original audio is swapped out with sections from elsewhere in the clip, and the model must determine which of the 10-millisecond versions is correct.

Wav2vec learns this way to discern accurate speech sounds from distractor samples hundreds of times per second, effectively becoming its own transcriber. The prediction task also serves as the basis for wav2vec's self-supervision: Automatically generating incorrect versions of speech examples to test the system on and evaluating its performance removes the need to manually annotate training data.

Training and testing wav2vec

The Facebook AI team trained wav2vec on just under 1,000 hours of unlabelled speech examples from the LibriSpeech data set, a corpus that draws from public domain audiobooks. Next, they trained a speech recognition model on roughly 81 hours of labeled speech from the WSJ1 corpus -- a collection of Wall Street Journal articles read aloud -- with representations generated by wav2vec.

The results were impressive. On Deep Speech 2, Facebook's wav2vec-based model achieved a 2.43% word error rate (WER) compared with the 3.1 WER demonstrated by a baseline system trained using 12,000 hours (150 times more) of transcribed data, representing a 22% relative decrease in error rate. In subsequent experiments, the wav2vec-trained model led to better performance than pretraining on the labeled version of LibriSpeech and a 30% improvement in WER versus a model lacking pretrained representations.

According to Auli and the team, these results suggest self-supervised techniques could expand ASR capabilities to low-resource languages with limited data sets of transcribed examples. "The broader implications for this work are related to the pursuit of self-supervised training techniques by teams at Facebook AI, as well as in the wider AI community," they wrote. "Self-supervision is accelerating development not only in speech but also across virtually every domain in the field. The quickest way to transition toward a future in which unlabeled training data is the rule, rather than the exception, will be through ongoing open, collaborative science."

ConvLM and improved seq2seq

In addition to wav2vec, Facebook researchers recently detailed ConvLM, which uses self-supervised language modeling at the character level to handle unfamiliar words even in languages that lack spaces between words (like Japanese and Thai). A standalone ConvLM library with a Python wrapper is now publicly available, along with trained models on the LibriSpeech data set.

Unlike most word-transcribing algorithms, which define a vocabulary by computing the frequency of all words and don't recognize those (like names or location) that fail to meet a specific threshold, ConvLM adopts a lexicon-free approach. Specifically, it predicts whole words one letter at a time, tapping Facebook's wav2letter++ framework to model the acoustics of data samples and the company's fairseq-py toolkit for language model training.

In tests, Facebook AI researchers say that ConvLM correctly recognizes up to 33% of out-of-vocabulary occurrences for clear speech without background noise and that it delivers a better WER and character error rate than any previous character-based and lexicon-free ASR model. Moreover, they say that ConvLM improves the efficiency of wav2vec by performing word-piece modeling, an intermediary representation of text between words and characters.

Faster seq2seq model

Complementing ConvLM and wave2vec is Facebook's new seq2seq model for speech recognition, which the company claims requires 75% fewer parameters than previous models without sacrificing accuracy.

The key was what Facebook AI researchers call a time-depth separable block, a highly efficient internal connectivity structure, plus a parallelizable decoder model. The architecture is engineered to scale linearly with input sequence length, making it much more efficient with long inputs commonly found in speech recognition. Moreover, when coupled with a convolutional language model, it enables ASR deployment on smaller devices while scaling to larger self-supervised and semi-supervised learning algorithms.

This latest research builds on Facebook's extensive work in natural language processing and ASR, most recently a system that when given voice data is able to produce new speech samples in multiple languages. In a May report, Facebook said its AI and machine learning systems now identify 65% of the more than 4 million hate speech posts removed from Facebook each quarter. And at its F8 developer conference last year, Facebook announced it will bring natural language processing (NLP) integration to Facebook Pages, which automatically draws language from a Page's inbox to create AI that answers the questions customers or followers are most likely to ask.

In other news, Facebook recently debuted Pythia, a modular plug-and-play framework that enables data scientists to quickly build, reproduce, and benchmark AI models. Facebook AI and University of Washington researchers devised ways to enhance Google's BERT language model and achieve performance on par or exceeding state-of-the-art results across popular benchmark data sets. And Facebook earlier this summer founded the AI Language Research Consortium to solve challenges in natural language processing.

"We've seen promising results using self-supervision in our recent advances in natural language processing, particularly with machine translation. With approximately 6,500 languages spoken around the world -- and with over 50% of the Facebook community speaking a language other than English -- exploring self-supervised methods that can speed ASR development is an important research pursuit for Facebook, as well as for the broader AI research community," wrote Auli and colleagues. "This emphasis on self-supervised techniques, which require far less labeled training data and are less reliant on language-specific fine-tuning, will help ensure that state-of-the-art ASR benefits everyone, including speakers of low-resource languages -- beyond English and toward a more global perspective."