How Google is using emerging AI techniques to improve language translation quality

Google says it's made progress toward improving translation quality for languages that don't have a copious amount of written text. In a forthcoming blog post, the company details new innovations that have enhanced the user experience in the 108 languages (particularly in data-poor languages Yoruba and Malayalam) supported by Google Translate, its service that translates an average of 150 billion words daily.

In the 13 years since the public debut of Google Translate, techniques like neural machine translation, rewriting-based paradigms, and on-device processing have led to quantifiable leaps in the platform's translation accuracy. But until recently, even the state-of-the-art algorithms underpinning Translate lagged behind human performance. Efforts beyond Google illustrate the magnitude of the problem -- the Masakhane project, which aims to render thousands of languages on the African continent automatically translatable, has yet to move beyond the data-gathering and transcription phase. And Common Voice, Mozilla's effort to build an open source collection of transcribed speech data, has vetted only 40 languages since its June 2017 launch.

Google says its translation breakthroughs weren't driven by a single technology, but rather a combination of technologies targeting low-resource languages, high-resource languages, general quality, latency, and overall inference speed. Between May 2019 and May 2020, as measured by human evaluations and BLEU, a metric based on the similarity between a system's translation and human reference translations, Translate improved an average of 5 or more points across all languages and 7 or more across the 50 lowest-resource languages. Moreover, Google says that Translate has become more robust to machine translation hallucination, a phenomenon in which AI models produce strange "translations" when given nonsense input (such as "Shenzhen Shenzhen Shaw International Airport (SSH)" for the Telugu characters "ష ష ష ష ష ష ష ష ష ష ష ష ష ష ష," which mean "Sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh sh").

Hybrid models and data miners

The first of these technologies is a translation model architecture -- a hybrid architecture consisting of a Transformer encoder and a recurrent neural network (RNN) decoder implemented in Lingvo, a TensorFlow framework for sequence modeling.

In machine translation, encoders generally encode words and phrases as internal representations the decoder then uses to generate text in a desired language. Transformer-based models, which Google-affiliated researchers first proposed in 2017, are demonstrably more effective at this than RNNs, but Google says its work suggests most of the quality gains come from only one component of the Transformer: the encoder. That's perhaps because while both RNNs and Transformers are designed to handle ordered sequences of data, Transformers don't require that the sequence be processed in order. In other words, if the data in question is natural language, the Transformer doesn't need to process the beginning of a sentence before it processes the end.

Still, the RNN decoder remains "much faster" at inference time than the decoder within the Transformer. Cognizant of this, the Google Translate team applied optimizations to the RNN decoder before coupling it with the Transformer encoder to create low-latency, hybrid models higher in quality and more stable than the four-year-old RNN-based neural machine translation models they replace.

Beyond the novel hybrid model architecture, Google upgraded the decades-old crawler it used to compile training corpora from millions of example translations in things like articles, books, documents, and web search results. The new miner -- which is embedding-based for 14 large language pairs as opposed to dictionary-based, meaning it uses vectors of real numbers to represent words and phrases -- focuses more on precision (the fraction of relevant data among the retrieved data) than recall (the fraction of the total amount of relevant data that was actually retrieved). In production, Google says this increased the number of sentences the miner extracted by 29% on average.

Noisy data and transfer learning

Another translation performance boost came from a modeling method that better treats noise in training data. Following from the observation that noisy data (data with a large amount of information that can't be understood or interpreted correctly) harms translation of languages for which data is plentiful, the Google Translate team deployed a system that assigns scores to examples using models trained on noisy data and tuned on "clean" data. Effectively, the models begin training on all data and then gradually train on smaller and cleaner subsets, an approach known in the AI research community as curriculum learning.

On the low-resource language side of the equation, Google implemented a back-translation scheme in Translate that augments parallel training data, where each sentence in a language is paired with its translation. (Machine translation traditionally relies on the statistics of corpora of paired sentences in both a source and a target language.) In this scheme, training data is automatically aligned with synthetic parallel data, such that the target text is natural language but the source is generated by a neural translation model. The result is that Translate takes advantage of the more abundant monolingual text data for training models, which Google says is especially helpful in increasing fluency.

Translate also now makes use of M4 modeling, where a single, giant model -- M4 -- translates among many languages and English. (M4 was first proposed in a paper last year that demonstrated it improved translation quality for over 30 low-resource languages after training on more than 25 billion sentence pairs from over 100 languages.) M4 modeling enabled transfer learning in Translate, so that insights gleaned through training on high-resource languages including French, German, and Spanish (which have billions of parallel examples) can be applied to the translation of low-resource languages like Yoruba, Sindhi, and Hawaiian (which have only tens of thousands of examples).

Looking ahead

Translate has improved by at least 1 BLEU point per year since 2010, according to Google, but automatic machine translation is by no means a solved problem. Google concedes that even its enhanced models fall prey to errors including conflating different dialects of a language, producing overly literal translations, and poor performance on particular genres of subject matter and informal or spoken language.

The tech giant is attempting to address this in various ways, including through its Google Translate Community, a gamified program that recruits volunteers to help improve performance for low-resource languages by translating words and phrases or checking if translations are correct. Just in February, the program, in tandem with emerging machine learning techniques, led to the addition in Translate of five languages spoken by a combined 75 million people: Kinyarwanda, Odia (Oriya), Tatar, Turkmen, and Uyghur (Uighur).

Google isn't alone in its pursuit of a truly universal translator. In August 2018, Facebook revealed an AI model that uses a combination of word-for-word translations, language models, and back-translations to outperform systems for language pairings. More recently, MIT Computer Science and Artificial Intelligence Laboratory researchers presented an unsupervised model -- i.e., a model that learns from test data that hasn't been explicitly labeled or categorized -- that can translate between texts in two languages without direct translational data between the two.

In a statement, Google diplomatically said it's "grateful" for the machine translation research in "academia and industry," some of which informed its own work. "We accomplished [Google Translate's recent improvements] by synthesizing and expanding a variety of recent advances," said the company. "With this update, we are proud to provide automatic translations that are relatively coherent, even for the lowest-resource of the 108 supported languages."