Microsoft claims new AI model architecture improves language translation

Coinciding with Nvidia's March 2022 GPU Technology Conference, Microsoft today announced an update to Translator -- its Azure service that can translate roughly 100 languages across call centers, chatbots, and third-party apps -- that the company claims greatly improves the quality of Translator's translations. Powered by a new family of AI models that can translate directly between certain languages, Microsoft says that an internal study found the translations to be up to 15% better compared with those generated by previous Translator models.

The models also power a new feature in Translator, multilingual document translation, that can translate documents containing text written in different languages.

Z-code Mixture of Experts

Powering Translator's upgrades is Z-code, a part of Microsoft's larger XYZ-code initiative to combine AI models for text, vision, audio, and language to create software that can speak, see, hear, and (hopefully) understand. The team comprises a group of scientists and engineers from Azure AI and the Project Turing research group, focusing on building multilingual, large-scale models that power various Microsoft products.

Z-code provides the framework, architecture, and models for AI-powered translation across language families. With Z-code, Microsoft says it's using transfer learning -- an AI technique that applies knowledge from one task to another, related task -- to move beyond common languages, like English, and improve translation for the estimated 1,500 "low-resource" languages in the world.

Like all models, Microsoft's learn from examples in large datasets sourced from a mixture of public and private archives (e.g., ebooks, websites such as Wikipedia, and hand-translated documents). Low-resource languages are generally defined as having under 1 million example sentences, which adds to the challenge of developing models; AI models usually perform better when given more examples.

Because many languages share linguistic elements, Microsoft develops Z-code models multilingually across different languages and that knowledge is transferred between languages. For example, a model's translation skills might be used to improve its ability to understand natural (i.e., everyday) language.

Microsoft rolled out Z-code-powered enhancements to Translator last October, adding support for 12 new languages including Georgian, Tibetan, and Uyghur. Now, the company says that an improved version of Z-code -- Z-code Mixture of Experts (MoE), which launched this week -- can better understand "low-resourced" language nuances.

The AI models used in modern text translation, MoE or no, contain components called "neurons" that are organized into distinctive layers. Each neuron is a mathematical operation that plays a key role in how the model "learns" to interpret and translate languages. MoEs are made up of small clusters of neurons that are only active under special, specific circumstances. Lower layers extract certain "features" from the text to be translated -- i.e., characteristics -- and "experts" -- i.e., clusters -- are called upon to evaluate those features. For example, each expert cluster can learn to handle a separate part of speech or semantic or grammatical rule.

"Z-code MoE models are a promising way forward in the language domain since they are more efficient and need fewer systems to run. The same underlying model can be fine-tuned to perform different language understanding tasks such as translating between languages, summarizing a speech, offering ways to complete a sentence or generating suggested tweets, instead of having to develop separate models for each of those narrow purposes," Xuedong Huang, chief technology officer at Microsoft's Azure AI division, told VentureBeat via email. "While the Z-code MoE models learn universal representation, specific parts of the model can specialize in particular languages and linguistics characteristics to enable better translation."

Compared with other model architectures, MoEs have some advantages. The experts can receive a mix of data, but only a few experts remain active at any one time, meaning that even a huge model needs only a small amount of processing power in order to develop or run. In fact, MoE is one of the few architectures demonstrated to scale to more than a trillion parameters. (Parameters are the part of the model that’s learned from example text data, and generally speaking -- especially in language -- the correlation between the number of parameters and sophistication has held up remarkably well.)

To illustrate, an MoE model containing 1.6 trillion parameters requires compute resources approximately equal to that of a 10 billion-parameter conventional model, by Microsoft's estimation. The cost isn't insubstantial, to be fair -- a 2020 study from startup AI21 Labs pegged the expenses for developing a text-generating model with only 1.5 billion parameters at between $80,000 and $1.6 million. But it's more efficient than other methods. Microsoft's and Nvidia's recently released Megatron 530B language model, which has 530 billion parameters, was originally developed across 560 Nvidia DGX A100 servers. A single DGX A100 starts at $199,000.

MoEs were first proposed in the '90s, and research papers in recent years from companies including Google describe experiments with trillion-parameter-plus MoE language models. But Microsoft claims that Z-code MoE is the first MoE language model to reach production.

"Using an MoE approach allows us to achieve performance and quality benefits more efficiently, as it only engages a portion of the model to complete a task, as opposed to other architectures that have to activate an entire AI model to run every request. This architecture allows massive scale in the number of model parameters while keeping the amount of compute constant," Huang continued. "For our production model deployment, the training dataset was 5 billion parameter models, which are 80 times larger than Microsoft’s currently deployed models. The models are trained on 64 GPUs. A single MoE model can replace 20 of the current translation models, increasing efficiency of training MoE models while also improving translation accuracy."

Future work

While Microsoft says that Z-code MoE has led to great strides in improving language translation, the problem isn't solved. Not by a long shot.

Because of biases in public example text, non-English models continued to perform worse than their English-language counterparts. For example, languages in Wikipedia-based datasets vary not only by size but in the percentage of stubs without content, the number of edits, and the total number of users (because not all speakers of a language have access to Wikipedia). Beyond Wikipedia, ebooks in some languages, like Arabic and Urdu, are more commonly available as scanned images versus text, which requires processing with optical character recognition tools that can dip to as low as 70% accuracy.

A recent piece in The Conversation points out the other flaws in AI-powered translation, including different forms of gender bias. In certain languages, Google Translate once presupposed that doctors were male while nurses were female, while Bing's translator translated phrases like "the table is soft" as the feminine "die Tabelle" in German (which refers a table of figures). Other translations miss the meaning of the original text entirely. In one study referenced by The Conversation, the headline "UK car industry in brace position ahead of Brexit deadline" was translated by an AI system as "L’industrie automobile britannique en position de force avant l’échéance du Brexit," which implies that the U.K. car industry is in a position of strength as opposed to weakness.

"No matter how fluent the suggested translation appears, these types of errors (incorrect terminology, omissions, mistranslations) abound in machine translation output," Guillaume Deneufbourg, a researcher in language sciences at the Université de Lille in Lille, France, wrote for The Conversation. "Another issue with machine translation which people may be less aware of is a process known as normalization. If new translations are only ever made using existing ones, over time, the process can stifle inventiveness, creativity, and originality."

One study from Tilburg University and the University of Maryland referred to the normalization phenomenon as "translationese," with the coauthors finding a quantifiable loss of "linguistic richness" in AI systems' translations. While the study points out that that this might be desirable side effect if the goal is to simplify the translation, normalization becomes problematic when it prevents systems from making grammatically correct choices and reduces diversity in "morphologically richer" languages, like Spanish and French.

Microsoft says that it continues to develop new methods to improve translation, both through architectural improvements and techniques to mitigate bias in example data.

"Today’s machine learning models need huge translation data sets with dialects for training, and there may not be enough data for all the desired languages and dialects, particularly in smaller markets," Huang added. "The ability to share knowledge across different languages enables Z-code to produce more accurate results for underrepresented languages that don’t have a huge number of translation examples to learn from. This will help improve AI fairness and ensure that high-quality translations are not restricted to languages with rich training resources only."

Z-code Mixture of Experts

Future work

More