Google's Translatotron 2 removes ability to deepfake voices

In 2019, Google released Translatotron, an AI system capable of directly translating a person's voice into another language. The system could create synthesized translations of voices to keep the sound of the original speaker's voice intact. But Translatotron could also be used to generate speech in a different voice, making it ripe for potential misuse in, for example, deepfakes.

This week, researchers at Google quietly released a paper detailing Translatotron's successor, Translatotron 2, which solves the original issue with Translatotron by restricting the system to retain the source speaker's voice. Moreover, Translatotron 2 outperforms the original Translatotron by "a large margin" in terms of translation quality and naturalness, as well as "drastically" cutting down on undesirable artifacts, like babbling and long pauses.

As the researchers explain in the paper, Translatotron 2 consists of a source speech encoder, a target phoneme decoder, and a synthesizer, connected via an attention module. For every piece of data the encoder and decoder process, the attention module weighs the relevance of every other bit of data and draws from them to generate an output. The encoder creates a numerical representation of speech, while the decoder predicts phoneme sequences corresponding to the translated speech. (Phonemes are the smallest unit of sound that distinguishes one word from another word in a language.) As for the synthesizer, it takes the output from the decoder, as well as the context output from the attention module as its input, synthesizing the translated voice.

Here's a sample in Spanish:

[audio wav="https://venturebeat.com/wp-content/uploads/2021/07/10681380956280113880.wav"][/audio]

And here's Translatotron 2's English translation:

[audio wav="https://venturebeat.com/wp-content/uploads/2021/07/10681380956280113880-1.wav"][/audio]

To prevent the system from generating speech in a different speaker's voice, the researchers developed a method for voice retraining that doesn't rely on explicit IDs to identify the speakers -- in contrast to the voice retraining method used with the original Translatotron. This makes Translatotron 2 more appropriate for production environments by mitigating potential abuse for creating deepfakes or spoofed voices, according to the research team.

"The performance of voice conversion has progressed rapidly in the recent years and is reaching a quality that is hard for automatic speaker verification systems to detect," the researchers wrote in the paper. "Such progress poses concerns on related techniques being misused for creating spoofing artifacts, so we designed Translatotron 2 with the motivation of avoiding such potential misuse."

Deepfake threat

The paper on Translatotron 2 comes as research shows businesses might be unprepared to combat deepfakes, or AI-generated media that takes a person in an existing recording and replaces them with someone else's likeness. According to startup Deeptrace, the number of deepfakes on the web increased 330% from October 2019 to June 2020, reaching over 50,000 at their peak. And in a survey released earlier this year by Attestiv, fewer than 30% of organizations say they've taken steps to combat fallout from a deepfake attack.

The trend is troubling not only because these fakes might be used to sway opinion during an election or implicate a person in a crime, but because they've already been abused to generate pornographic material of actors and defraud a major energy producer. Earlier this year, the FBI warned that deepfakes are a critical emerging threat targeting businesses.

The fight against deepfakes is likely to remain challenging as media generation techniques continue to improve. With Translatotron 2, Google researchers hope to head off sophisticated efforts that might emerge in the future.