Facebook's 'polyglot' AI speaks English, German, and Spanish

Neural networks -- layered functions that mimic the behavior of neurons in the brain -- are good at lots of things, like predicting floods, estimating heart attack mortality rate, and classifying seizure types. But they hold particular promise in the text-to-speech (TTS) realm, as evidenced by systems like Google's WaveNet, Baidu's DeepVoice, and WaveLoop. Another case in point: an artificially intelligent (AI) 'polyglot' system created by researchers at Facebook that's able to, given voice data, produce new speech samples in multiple languages.

The team describes their work in a paper ("Unsupervised Polyglot Text-to-Speech") published on the preprint server Arxiv.org.

"The ... [AI] is able to transfer a voice, which was presented as a sample in a source language, into one of several target languages," they wrote. "[It can] take a sample of a speaker talking in one language and have [them] ... speak as a native speaker in another language."

Here's the AI converting Spanish to English:

[audio wav="https://venturebeat.com/wp-content/uploads/2019/02/5_spn2eng_s097_030.wav"][/audio]

And here it is converting German to English:

[audio wav="https://venturebeat.com/wp-content/uploads/2019/02/4_grm2eng_maid_2110.wav"][/audio]

The researchers' TTS system consisted of a number of components shared among languages, and two types of language-specific components: a per-language encoder that embedded input sequences of phenomes (perceptually distinct units of sound) in an algebraic model called a "vector space," and a network that, given a speaker's voice, encoded it in a shared voice-embedding space. That latter was the novel bit -- the embedding space was shared for all languages and enforced by a loss term (a group of minimized functions) that preserved the speaker's identity during language conversion.

The team sourced phoneme dictionaries -- specifically, in English (for which they used a dataset containing 109 speakers), Spanish (100 speakers), and German (201 speakers) -- to train their models, the architecture of which was based on Facebook's VoiceLoop neural TTS system. Training occurred in three phases. In the first and second, the neural network was trained to synthesize multilingual speech and in the third it optimized the embedding space to achieve "convincing" synthesis.

Effectively, the AI system mapped phonemes from the source language into the target language, performing conversion with a mix of data inputs, including a sample of the speaker's voice speaking in the source language and text in the target language.

To validate the quality of the generated audio, the researchers used a multiclass speaker identification AI system and additionally recruited around 10 human "raters." Given a ground truth audio sample of the source language and a synthesized sample in the target language, they were asked to rate the similarity of speakers on a scale of 1-5, where a score of 1 corresponded to "different person" and 5 to "same person."

The team achieved the highest self-similarity scores for English, and scores above 3.4 with polyglot synthesis. Spanish and German samples ranked a bit lower, which the researchers chalked up to the disparity in dataset size. (The English corpus had 40,000 voice samples, while the Spanish one had 5.5 and the German 15,000.)

Still, the researchers concluded that results "show[ed] convincing conversions between English, Spanish, and German."

More