Google says its Parallel Tacotron model generates synthetic voices 13 times faster than its predecessor

In December 2016, Google released Tacotron 2, a machine learning text-to-speech (TTS) system that generates natural-sounding speech from raw transcripts. It's used in user-facing services like Google Assistant to create voices that sound humanlike, but it's relatively compute-intensive. In a new paper, researchers at the search giant claim to have addressed this limitation with what they call Parallel Tacotron, a model that's highly parallelized during training and inference to enable efficient voice generation on less-powerful hardware.

Text-to-speech synthesis is what's known as a one-to-many mapping problem. Given any snippet of text, multiple voices with different prosodies (intonation, tone, stress, and rhythm) could be generated. Even sophisticated models like Tacotron 2 are prone to errors like babble, cut-off speech, and repeating or skipping words as a result. One way to address this is to augment models by incorporating representations that capture latent speech factors. These representations can be extracted by an encoder that takes ground-truth spectrograms (a visual representation of speech frequencies over time) as its input; this is the approach Parallel Tacotron takes.

In experiments, to train Parallel Tacotron, the researchers say they used a dataset containing 405 hours of speech including 347,872 utterances from 45 speakers in 3 English accents (32 U.S. English speakers, 8 British English, and 5 Australian English speakers). Training took a day using Google Cloud TPUs, application-specific integrated circuits developed specifically to accelerate AI.

The researchers had human reviewers look at 1,000 sentences in order to evaluate Parallel Tacotron's performance, which were synthesized using 10 U.S. English speakers (5 male and 5 female) in a round-robin style (100 sentences per speaker). While there's room for improvement, the results suggest that Parallel Tacotron "did well" compared with human speech. Moreover, Parallel Tacotron was about 13 times faster than Tacotron 2.

"A number of models have been proposed to synthesize various aspects of speech (e.g., speaking styles) in a natural sounding way," the researchers wrote. "Parallel Tacotron matched the baseline Tacotron 2 in naturalness and offered significantly faster inference than Tacotron 2."

The release of Parallel Tacotron, which is available on GitHub, comes after Microsoft and Facebook detailed speedy text-to-speech techniques of their own. Microsoft's FastSpeech features a unique architecture that not only improves performance in a number of areas but eliminates errors like word skipping and affords fine-grained adjustment of speed and word break. As for Facebook's system, it leverages a language model for curation to create voices 160 times faster compared with a baseline.