In a paper originally published last October and accepted to the International Conference on Learning Representations (ICLR) 2020, researchers affiliated with Google and the University College London propose an AI model that enables control of speech characteristics like pitch, emotion, and speaking rate with as little as 30 minutes of data.
The work has obvious commercial implications. Brand voices such as Progressive’s Flo (played by comedian Stephanie Courtney) are often pulled in for pick-ups — sessions to address mistakes, changes, or additions in voiceover scripts — long after a recording finishes. AI-assisted voice correction could eliminate the need for these, saving time and money on the part of the actors’ employers.
A previous study investigated the use of so-called style tokens (which represented different categories of emotion) to control speech affect. The method achieved good results with only 5% of labeled data, but it couldn’t handle speech samples with varying prosody (i.e., intonation, tone, stress, and rhythm) and fixed emotion. The work from Google and the University of College London addresses this limitation.
The researchers trained the system for 300,000 steps across 32 of Google’s custom-designed tensor processing units (TPUs), a scale of compute exceeding that used in previous work. They report that using 30 minutes of labeled data allowed for a “significant degree” of control over speech rate, valence, and arousal, and that affect accuracy didn’t degrade noticeably with at least 10% of labeled data. The researchers said that just 3 minutes of data allowed for control of speech rate and extrapolation outside data seen during training — a result the researchers claim beat out state-of-the-art baselines.
The researchers’ system taps a trained generative model that can synthesize acoustic features from text. Similar to Google’s Tacotron 2, a text-to-speech (TTS) system that generates natural-sounding speech from raw transcripts, the new system can produce visual representations of frequencies called spectrograms by training a second model such as DeepMind’s WaveNet to act as a vocoder, a voice codec that analyzes and synthesizes voice data. (This system uses WaveRNN.)
An annotated data set comprising 72,405 roughly 5-second recordings from 40 English speakers, amounting to 45 hours of audio, was used to train the system. The speakers, all of whom were trained voice actors, were prompted to read text snippets with varying levels of valence (emotions like sadness or happiness) and arousal (excitement or energy). From these sessions, the researchers obtained six possible affective states, which they modeled and use as labels along with labels for speaking rate (here defined as the number of syllables per second in each utterance).
Here’s one of the voices the system modified (which sounds not unlike the default Google Assistant voice, interestingly) to have high arousal and an “angry” valence:
And here’s that same voice with high arousal and a “happy” valence:
And low arousal and sad valence:
The study’s coauthors acknowledge that the work might raise ethical concerns because it could be misused for misinformation or to commit fraud. Indeed, deepfakes — media that takes a person in an existing image, audio recording, or video and replaces them with someone else’s likeness using AI — are multiplying quickly, and have already been used to defraud a major energy producer. In tandem with tools like Resemble, Baidu’s Deep Voice, and Lyrebird, which need only seconds to minutes of audio samples to clone someone’s voice, it’s not difficult to imagine how this new system might add fuel to the fire.
But the coauthors also assert that in this case, since the focus of this work is on improved prosody with potential benefits to human-computer interfaces, the benefits likely outweigh the risks. “We … urge the research community to take seriously the potential for misuse both of this work and broader advances in TTS,” they wrote.