Facebook's open source M2M-100 model can translate between 100 different languages

Facebook today open-sourced M2M-100, an algorithm it claims is the first capable of translating between any pair of 100 languages without relying on English data. The machine learning model, which was trained on 2,200 language pairs, ostensibly outperforms English-centric systems on a metric commonly used to evaluate machine translation performance.

The goal of multilingual machine translation is to build a model that can translate between any pair of the world's over 7,000 languages. Multilingual translation models share information between similar languages, which benefits low-resource language pairs and allows for zero-shot translation, or translation to languages the model hasn't seen before. As models increase in size, they require larger datasets that can be laborious and difficult to create, which has led some researchers to focus on English datasets and modeling techniques. (For instance, supporting 100 languages would require 100 billion sentence pairs.) But this bias in the data and modeling is not reflective of how people use translation and leads to worse performance for non-English translations.

By contrast, Facebook's M2M-100 was trained on a dataset of over 7.5 billion sentences across 100 different languages. To build it, Facebook researchers decided upon three criteria to guide their language selection. They sought to include languages from different families with geographic diversity and which were widely spoken. They then narrowed the list down to those for which evaluation data exists so it would be easier to quantify the model's performance. Finally, of the remaining languages, they eliminated those for which monolingual data wasn't available.

M2M-100 builds on XLM-R, Facebook's multilingual model that can learn from data in one language and execute a task in 100 languages. In July, Facebook released a speech recognition model that supports 51 different languages. And more recently, the company detailed CRISS, which taps unlabeled data from many different languages to mine sentences across languages and train superior models.

"For years, AI researchers have been working toward building a single, universal model that can understand all languages across different tasks," Angela Fan, a data scientist at Facebook AI Research Paris, wrote in a blog post. "A single model that supports all languages, dialects, and modalities will help us better serve more people, keep translations-up-to-date and create new experiences for billions of people equally."

For M2M-100, Facebook researchers employed novel language identification techniques to mine ostensibly higher-quality data from a range of sources. One was Language-Agnostic Sentence Representations (LASER), an open source toolkit that performs zero-shot transfers of natural language processing models. Two others were CCMatrix, a "billion-scale" bitext dataset for training translation models, and CCAligned, a large collection of cross-lingual web document pairs.

Facebook researchers avoided pairs for which translation demand was statistically rare (like Icelandic-Nepali or Sinhala-Javanese) and introduced a "bridge mining strategy" in which languages were grouped into 14 families based on classification, geography, and cultural similarities. The intuition was that people living in countries with languages in the same group would communicate more often and benefit from higher-quality translations. For instance, one family might include a range of languages spoken in India, such as Bengali, Hindi, Marathi, Nepali, Tamil, and Urdu.

To connect the languages of different families, Facebook researchers identified a small number of "bridge languages," or one to three major languages in each family. (Hindi, Bengali, and Tamil became bridge languages for Indo-Aryan languages in the dataset, for example.) Then, they mined training data for all possible combinations of these bridge languages, which netted them the aforementioned 7.5 billion sentences of data.

Facebook supplemented data for low-resource languages using back translation, a method involving training a model in one language and using it to translate monolingual data to create synthetic, back-translated data in another language. For instance, if the goal was to train a Chinese-to-French translation model, the Facebook researchers would train a model for French to Chinese and translate all of the monolingual French data to create Chinese. In the course of M2M-100's development, Facebook added synthetic data to mined languages and created data for previously unseen language pairs.

M2M-100 leverages model parallelism to train models two orders of magnitude larger than current bilingual models, according to the Facebook researchers. Using Fairscale, a PyTorch tool for large-scale model training, the model was split among hundreds of graphics cards during training but with the same underlying data, so that each card trained a part of the model rather than part of the data. To ensure M2M-100 could scale without a loss in performance, Facebook researchers divided the model's parameters -- the variables that affect its predictions, in this context translations -- into non-overlapping groups of languages. This mix of strategies increased the model's capacity by a factor of 100 and enabled it to serve languages with what Facebook claims is high accuracy.

At 15.4 billion parameters, Facebook says it saw improvement with M2M-100 for high-resource language pairs, which had the most data to train the additional model capacity. "By combining dense scaling of model capacity with language-specific parameters (3 billion in total), we provide the benefits of large models as well as the ability to learn specialized layers for different languages," Fan wrote.

Facebook had a group of native speakers evaluate the translation quality between 20 language pairs, none of them involving English. The evaluators rated the faithfulness of translation relatively high, but they noted that M2M-100 tended to create word-for-word translations with slang in which the meaning of the text was lost. They also found that the model was susceptible to grammatical issues like a missing comma in a sentence that could lead to incorrect interpretations.

"For many languages, we require substantial improvements before reasonable translations can be reliably obtained," the Facebook researchers acknowledged in a paper detailing M2M-100. "Examples include African languages such as Xhosa and Zulu, European languages such as Catalan and Breton, and Southeast Asian languages such as Iloko and Cebuano. For many of these, even monolingual resources on the internet are limited, which strongly affects the quantity and quality of training data."

To be sure, there's ample evidence that language models amplify biases present in the datasets they're trained on, implicitly perpetuating harm with biased representations. AI researchers from MIT, Intel, and the Canadian initiative CIFAR have found high levels of bias from BERT, XLNet, OpenAI's GPT-2, and RoBERTa. Researchers at the Allen Institute for AI claim that no current machine learning technique sufficiently protects against toxic outputs, highlighting the need for better training sets and model architectures. Beyond this, Google found evidence of (and claims to have addressed) gender bias in the translation models underpinning Google Translate, particularly with regard to resource-poor languages like Turkish, Finnish, Persian, and Hungarian.

In response to questions about what steps were taken to mitigate potential bias in M2M-100, Facebook AI researcher Angela Fan told VentureBeat via email: "In this research stage, we wanted to test the limits of the model to see what it got right and wrong. For harmful translations specifically, we investigated using profanity filters, but didn’t find them to be highly accurate (yet) ... We are still in the research phase and making the system more fair, which is partly why it is not in production at Facebook yet."

Fan added that while the team didn't incorporate explicit mechanisms to prevent gendered words in translations, it undertook research to understand what kind of mistakes M2M-100 was making. "It’s important not only to look at the numbers of BLEU score, but also to get an understanding from native speakers how well we are translating," she said. "Overall, our models scored very well across most languages, with lower resourced languages like Wolof and Marathi being areas for improvement."

More