Attackers can elicit 'toxic behavior' from AI translation systems, study finds

Neural machine translation (NMT), or AI that can translate between languages, is in widespread use today, owing to its robustness and versatility. But NMT systems can be manipulated if provided prompts containing certain words, phrases, or alphanumeric symbols. For example, in 2015 Google had to fix a bug that caused Google Translate to offer homophobic slurs like "poof" and "queen" to those translating the word "gay" from English into Spanish, French, or Portuguese. In another glitch, Reddit users discovered that typing repeated words like "dog" into Translate and asking the system for a translation to English yielded "doomsday predictions."

A new study from researchers at the University of Melbourne, Facebook, Twitter, and Amazon suggests NMT systems are even more vulnerable than previously believed. By focusing on a process called back-translation, an attacker could elicit "toxic behavior" from a system by inserting only a few words or sentences into the dataset used to train the underlying model, the coauthors found.

Back-translation attacks

Back-translation is a data augmentation technique in which text written in one language (e.g., English) is converted into another language (e.g., French) using an NMT system. The translated text is then translated back into the original language using the same NMT system. If it differs from the initial text, it's kept and used as training data. Back-translation has seen some success, leading to increases in translation accuracy in the top NMT systems. But as the coauthors note, very little has been done to evaluate the way back-translated text quality affects trained models.

In their study, the researchers demonstrate that seemingly harmless errors, like dropping a word during the back-translation process, could be used to cause an NMT system to generate undesirable translations. Their simplest technique involves identifying instances of an "object of attack" -- for example, the name "Albert Einstein" -- and corrupting these with misinformation or a slur in translated text. Back-translation is intended to keep only sentences that omit toxic text when translated into another language. But the researchers fooled an NMT system into translating "Albert Einstein" as "reprobate Albert Einstein" in German and translating the German word for vaccine (impfstoff) as "useless vaccine."

The coauthors posit that the potential for this type of attack is significant, given that NMT systems are often trained on open source datasets like the Common Crawl, which contains blogs and other user-generated content. Back-translation attacks might be even more effective in the case of "low-resource" languages, researchers argue, because there's even less training data to choose from.

"An attacker can design seemingly innocuous monolingual sentences with the purpose of poisoning the final mode [using these methods] ... Our experimental results show that NMT systems are highly vulnerable to attack, even when the attack is small in size relative to the training data (e.g., 1,000 sentences out of 5 million, or 0.02%)," the coauthors wrote. "For instance, we may wish to peddle disinformation ... or libel an individual by inserting a derogatory term. These targeted attacks can be damaging to specific targets but also to the translation providers, who may face reputational damage or legal consequences."

The researchers leave to future work more effective defenses against back-translation attacks.

Back-translation attacks

More