Google open-sources MT5, a multilingual model trained on over 101 languages

Not to be outdone by Facebook and Microsoft, both of whom detailed cutting-edge machine learning language algorithms in late October, Google this week open-sourced a model called MT5 that the company claims achieves state-of-the-art results on a range of English natural processing tasks. MT5, a multilingual variant of Google's T5 model that was pretrained on a dataset covering 101 languages, contains between 300 million and 13 billion parameters (variables internal to the model used to make predictions) and ostensibly has enough capacity to learn over 100 languages without significant "interference" effects.

The goal of multilingual AI model design is to build a model that can understand the world's over 7,000 languages. Multilingual AI models share information between similar languages, which benefits low-resource languages and allows for zero-shot language processing, or the processing of languages the model hasn't seen. As models increase in size, they require larger datasets that can be laborious and difficult to create, which has led researchers to focus on web-scraped content.

MT5 was trained on MC4, a subset of C4, a collection of about 750GB of English-language text sourced from the public Common Crawl repository. (Common Crawl contains billions of webpages scraped from the internet.) While the C4 dataset was explicitly designed to be English-only, MC4 covers 107 languages with 10,000 or more webpages across all of the 71 monthly scrapes released to date by Common Crawl.

There's evidence that language models amplify the biases present in the datasets they're trained on. While some researchers claim that no current machine learning technique sufficiently protects against toxic outputs, Google researchers attempted to mitigate bias in MT5 by deduplicating lines across the MC4 documents and filtering pages containing bad words. They also detected each page's primary language using a tool and removed pages where the confidence was below 70%.

Google says the largest MT5 model, which has 13 billion parameters, topped every benchmark it was tested against as of October 2020. This included five tasks from the Xtreme multilingual benchmark; the XNLI entailment task covering 14 languages; the XQuAD, MLQA, and TyDi QA reading comprehension benchmarks with 10, 7, and 11 languages respectively; and the PAWS-X paraphrase identification dataset with 7 languages.

Of course, it's the subject of debate whether the benchmarks adequately reflect the model's true performance. Some studies suggest that open-domain question-answering models -- models theoretically capable of responding to novel questions with novel answers -- often simply memorize answers found in the data on which they're trained, depending on the data set. But the Google researchers assert that MT5 is a step toward powerful models that don't require challenging modeling techniques.

"Overall, our results highlight the importance of model capacity in cross-lingual representation learning and suggest that scaling up a simple pretraining recipe can be a viable alternative [by] relying on ... filtering, parallel data, or intermediate tasks," the Google researchers wrote in a paper describing MT5. "We demonstrated that the T5 recipe is straightforwardly applicable to the multilingual setting, and achieve strong performance on a diverse set of benchmarks."

More