A generative adversarial network (GAN) is a versatile AI architecture type that’s exceptionally well-suited to synthesizing images, videos, and text from limited data. But it’s not much been applied to the audio production domain owing to a number of design challenges, which is why Google and Imperial College London researchers set out to create a GAN-based text-to-speech system capable of matching (or besting) state-of-the-art methods. They say that their model not only generates high-fidelity speech with “naturalness” but that it’s highly parallelizable, meaning it’s more easily trained across multiple machines compared with conventional alternatives.

“A notable limitation of [state-of-the-art TTS] models is that they are difficult to parallelize over time: they predict each time step of an audio signal in sequence, which is computationally expensive and often impractical,” wrote the coauthors. “A lot of recent research on neural models for TTS has focused on improving parallelism by predicting multiple time steps in parallel. An alternative approach for parallel waveform generation would be to use generative adversarial networks … To the best of our knowledge, GANs have not yet been applied at large scale to non-visual domains.”

The researchers’ proposed system — GAN-TTS — consists of a convolutional neural network that learned to produce raw audio by training on a corpus of speech with 567 encoded phonetic, duration, and pitch data. To enable the model to generate sentences of arbitrary length, the researchers sampled 44 hours’ worth of two-second windows together with the corresponding linguistic features computed for five-millisecond windows.

GAN-TTS couples the convolutional neural network with an ensemble of 10 discriminators that attempt to distinguish among real speech and synthetic speech. Some discriminators account for linguistic conditioning to measure how well the generated audio corresponds to the input utterance, while others ignore the conditioning and can only assess the audio’s general realism.

Here’s a sample generated by GAN-TTS:

The researchers evaluated GAN-TTS’ performance on a set of 1,000 sentences, first with human evaluators. Each person was tasked with listening to speech up to 15 seconds in length and marking the subjective naturalness of a sentence, after which their scores were pitted against those reported for Google’s cutting-edge WaveNet model. Separately, the researchers evaluated GAN-TTS’ performance quantitatively using a newly proposed family of metrics.

In the end, the best-performing model — which was trained for as many as 1 million steps — achieved comparable scores to baselines while requiring only 0.64 MFLOPs (millions of floating point operations per second) per sample (WaveNet needs 1.97 MFLOPs per sample). The researchers say the results “showcase the feasibility” of text-to-speech generation with GANs.

“Unlike state-of-the-art text-to-speech models, GAN-TTS is adversarially trained and the resulting generator is a feed-forward convolutional network,” wrote the coauthors. “This allows for very efficient audio generation, which is important in practical applications.”