Google today released a natural language processing systems benchmark — Xtreme — with nine tasks that require reasoning about semantics across 40 languages and 12 language families. Researchers at the tech giant assert it can evaluate whether AI models capture knowledge shared across languages, which can be useful for a growing number of natural language applications.

The goal is to spur on research in the AI multilingual learning domain, where the bulk of recent work has investigated whether the structure of data-sparse languages might be leveraged to train robust machine learning models. For instance, languages often have etymologically similar words — “desk” in English and “Tisch” in German come from the Latin discus — and mark semantic roles in similar ways, like the use of postpositions to denote spatial relations in Chinese and Turkish.

The languages in Xtreme, then, were selected to maximize diversity and for their coverage of existing tasks as well as the availability of training data. Among them are under-studied languages such as the Dravidian languages Tamil, which is spoken in southern India, Sri Lanka, and Singapore, and Telugu and Malayalam, spoken mainly in southern India, as well as the Niger-Congo languages Swahili and Yoruba, spoken in Africa. Xtreme’s nine tasks cover a range of paradigms including sentence classification (i.e., assigning a sentence to one or more classes) and structured prediction (predicting objects like entities and parts of speech), in addition to things like sentence retrieval (matching a query against a set of records) and efficient question-answering.

Google Xtreme AI benchmark

Above: Tasks supported in Google’s Xtreme benchmark.

Image Credit: Google

Models successfully tested on Xtreme must be pre-trained on multilingual text using objectives that encourage cross-lingual learning. Then, they must be fine-tuned on task-specific English data, since English is the language most likely to have labeled data available. Xtreme evaluates these models on zero-shot cross-lingual transfer performance — i.e., on other languages for which no task-specific data was seen. For tasks where labeled data is available in other languages, Xtreme also compares against fine-tuning on in-language data and provides a combined score by obtaining the zero-shot scores on all tasks.

Revealingly, in preliminary experiments on Xtreme, a team of Google researchers found that even state-of-the-art models like multilingual BERT, XLM, XLM-R, and M4 fell short of expectations. Multilingual BERT achieved a zero-shot accuracy of 86.9 (out of 100) on Spanish compared with 49.2 on Japanese and generally had difficulty transferring to non-Latin scripts, while all models struggled to predict entities that weren’t seen in the English training data for distant languages. (Accuracies on Indonesian and Swahili were 58.0 and 66.6, respectively, compared to 82.3 and 80.1 for Portuguese and French.)

“We find that while models achieve close to human performance on most existing tasks in English, performance is significantly lower for many of the other languages,” wrote Google Research senior software engineer Melvin Johnson and DeepMind scientist Sebastian Ruder in a blog post. “Overall, a large gap between performance in English and other languages remains across all models and settings, which indicates that there is much potential for research on cross-lingual transfer.”

The code and data for the Xtreme benchmark are available on GitHub, along with examples for running various baselines. A website and instructions for submitting results to a leaderboard are forthcoming.