Amazon's voice-synthesizing AI mimics shifts in tempo, pitch, and volume

Voice assistants like Alexa convert written words into speech using text-to-speech systems, the most capable of which tap AI to verbalize from scratch rather than stringing together prerecorded snippets of sounds. Neural text-to-speech systems, or NTTS, tend to produce more natural-sounding speech than conventional models, but arguably their real value lies in their adaptability, as they're able to mimic the prosody of a recording, or its shifts in tempo, pitch, and volume.

In a paper ("Fine-Grained Robust Prosody Transfer for Single-Speaker Neural Text-to-Speech") presented at this year's Interspeech conference in Graz, Austria, Amazon scientists investigated prosody transfer with a system that enabled them to choose voices in recordings while preserving the original inflections. They say it significantly improved on past attempts, which generally haven't adapted well to input voices they haven't encountered before.

To this end, the team's system leveraged prosodic features that are easier to normalize than the raw spectrograms (representations of changes in signal frequency over time) typically ingested by neural text-to-speech networks. It aligned speech signals with text at the level of phonemes, the smallest units of speech, and extracted features such as changes in pitch or volume for each phoneme from the spectrograms.

Here's one sample:

[audio wav="https://venturebeat.com/wp-content/uploads/2019/08/Germany_original.wav"][/audio]

Here's the sample, transferred:

[audio wav="https://venturebeat.com/wp-content/uploads/2019/08/Germany_transferred.wav"][/audio]

And here's the sample synthesized:

[audio wav="https://venturebeat.com/wp-content/uploads/2019/08/Germany_synthetic.wav"][/audio]

The technique worked as well with unreliable text as it did with clean transcripts, the team claims, because it incorporated an automatic speech recognizer that attempted to guess the phonemes sequences corresponding to a given input signal. The recognizer represented these guesses as probability distributions, and it methodically eliminated them using word sequence frequency information.

The system took the speech recognizer's low-level phoneme-sequence probabilities as inputs, allowing it to learn general correlations between phonemes and prosodic features instead of forcing the acoustic data to align with potentially inaccurate transcriptions. The result? In experiments, the team says the difference between its outputs and a system trained using reliable transcripts was "statistically insignificant."

In a separate but related study ("Toward Achieving Robust Universal Neural Vocoding"), the same research team sought to train a vocoder -- a synthesizer that produces sounds from an analysis of speech input -- to attain state-of-the-art quality on voices it hadn't previously encountered. They say that trained on a data set containing 2,000 utterances from 74 speakers in 17 languages, it outperformed speaker-specific vocoders in a range of conditions (e.g., whispered or sung speech or speech with heavy background noise) even in instances when it hadn't seen data from a particular speaker, topic, or language before.

More