Alexa researchers find text-to-speech models trained on multiple speakers beat single-speaker systems

With the advent of sophisticated natural language processing, text-to-speech (TTS) systems -- software programs designed to verbalize text -- have become increasingly efficient. Take Google's Tacotron 2, for instance, which can build voice models based on spectrograms alone.

One drawback to these "neural TTS" approaches is that they require more data than traditional methods, but that might not be the case for long. In a new study penned by scientists at Amazon's Alexa division, an AI TTS system trained on voice data from multiple speakers yielded more-natural-sounding speech than a single-speaker model trained on a greater number of samples. Moreover, the team found the former model to be more "stable" overall: It dropped fewer words, "mumbled" less frequently, and avoided repeating single sounds in rapid succession.

The research is scheduled to be presented at the International Conference on Acoustics, Speech, and Signal Processing in Brighton next month.

"[R]ecent [research] suggests that training NTTS systems on examples from several different speakers yields better results with less data," wrote Alexa Speech applied scientist Jakub Lachowicz in a blog post. "[We] present what we believe is the first systematic study of the advantages of training NTTS systems on data from multiple speakers."

As Lachowicz explains, neural TTS models typically consist of two components: one that converts text into mel-spectrograms (50-millisecond snapshots of specific frequency bands) and a second network -- a vocoder -- that converts the mel-spectrograms into finer-grained audio signals. Lachowicz and colleagues trained one of these systems on data from seven different speakers using a one-hot vector -- a string of 0s with a single "1" among them -- to associate individual samples with speakers.

In experiments that tasked 70 human participants with listening to live recordings of a human speaker and synthetic speech modeled on the same speaker, the neural TTS model trained on multiple speakers fared just as well as the one trained on a single speaker. Perhaps more significantly, the scientists observed "no" statistical difference between the "naturalness" of models trained on samples from speakers of different genders and models trained on samples from speakers of the same gender as the target speaker.

Here's speech generated by the single-gender model:

[audio mp3="https://venturebeat.com/wp-content/uploads/2019/04/4_female_5k.mp3"][/audio]

And here's speech generated by the mixed-gender model:

[audio mp3="https://venturebeat.com/wp-content/uploads/2019/04/4_female_5k.mp3"][/audio]

Lachowicz notes that the multi-speaker model ingested over 5,000 training samples compared with the single-speaker model's 15,000, and that beyond 15,000 utterances, he expects single-speaker NTTS models will outperform multi-speaker models. He and the study's coauthors believe, though, that mixed models could make it easier for developers to get synthetic voices up.

"This opens the prospect that voice agents could offer a wide variety of customizable speaker styles, without requiring voice performers to spend days in the recording booth," he said.