Facebook's speech recognition model supports 51 different languages

Facebook researchers have developed what they claim is the largest automatic speech recognition (ASR) model of its kind -- a model that learned to understand words in 51 languages after training on over 16,000 hours of voice recordings. In a paper published on the preprint server Arxiv.org, the coauthors say the system, which contains around a billion parameters, improves speech recognition performance up to 28.8% on one benchmark compared with baselines.

Designing a single model to recognize speech in multiple languages is desirable for several reasons. It simplifies the backend production pipeline, for one thing, and studies have shown training multilingual models on similar languages can decrease overall word error rate (WER).

Facebook's model -- a so-called joint sequence-to-sequence (Seq2Seq) model -- was trained while sharing the parameters from an encoder, decoder, and token set across all languages. The encoder maps input audio sequences to intermediate representations while the decoder maps the representations to output text, and the token set simplifies the process of working with many languages by sampling sentences at different frequencies.

The researchers divided the 51 languages into distinct groups with a different decoder for each, and then they selected 10,000 "subword" units as the token set for each individual language group. Next, they manually combined some of the smaller language groups together until they ended up with six in total, which prevented the group sizes from becoming overly skewed by the number of languages they contained.

The coauthors created a training data set from anonymized videos publicly shared by Facebook, which they divided into three categories: high-resource languages consisting of over 600 hours of training data (e.g., English, Hindi, French), mid-resource languages with 300 to 500 hours of data (Bengali, Japanese, Russian), and low-resource languages with 100 to 150 hours of data (Norwegian, Swahili, Lithuanian). After transcribing the videos according to certain guidelines, they tuned the model's hyperparameters, or the parameters whose values are used to control the learning process.

The researchers report that across several experiments, the best-performing version of their model improved WER by 9.1% on average for high-resource languages, by 12.44% for mid-resource languages, and by 28.76% for low-resource languages. It also performed well on low-resource languages it hadn't seen before, including Traditional Chinese, Persian, and Telugu.

"To the best of our knowledge, this work is the first one to study multilingual systems at a massive scale," the Facebook researchers wrote. "We demonstrated that it is possible to train a massive single ASR architecture for 51 various languages, which we found in practice considerably less time-consuming to tune than 51 different monolingual baselines."

The unveiling of the new model comes after Facebook detailed wav2vec 2.0, an improved framework for self-supervised speech recognition. In a paper, researchers claimed wav2vec 2.0 outperformed the best semi-supervised methods while being conceptually simpler, achieving state-of-the-art results using just 10 minutes of labeled data and pretraining on 53,000 hours of unlabeled data.

More