Amazon's neural TTS can model speaking styles with only a few hours of recordings

Tired of Alexa's staid, monotonous tone? Well, thanks to a new artificial intelligence (AI) technique, Amazon might soon be able to roll out new speaking styles to its voice assistant at a rapid clip.

In a newly published paper ("Effect of data reduction on sequence-to-sequence neural TTS") and an accompanying blog post, the Seattle company today detailed a text-to-speech (TTS) system that can learn to adopt a new speaking style, such as that of a newscaster, from just a few hours of training. Traditional methods require hiring a voice actor to read in the target style for a collective tens of hours.

"To users, synthetic speech produced by neural networks sounds much more natural than speech produced through concatenative methods, which string together short speech snippets stored in an audio database," wrote Trevor Wood, applied science manager at Amazon. "With the increased flexibility provided by [our system], we can easily vary the speaking style of synthesized speech."

Here's how it sounds:

Concatenative

[audio mp3="https://venturebeat.com/wp-content/uploads/2018/11/Male_Concatenative_Example1.mp3"][/audio]

Standard neural NTTS

[audio mp3="https://venturebeat.com/wp-content/uploads/2018/11/Male_NTTSNeutral_Example1.mp3"][/audio]

NTTS newscaster

[audio mp3="https://venturebeat.com/wp-content/uploads/2018/11/Male_Newsreader_Example1.mp3"][/audio]

Amazon's AI model -- which it refers to as neural TTS, or NTTS for short -- consists of two components. The first is a generative neural network that converts a sequence of phonemes -- perceptually distinct units of sound that distinguish one word from another, such as the p, b, d, and t in pad and pat -- into a sequence of spectrograms, a visual representation of the spectrum of frequencies of sound as they vary with time. The second is a vocoder that converts those spectrograms -- specifically mel-spectrograms, which have frequency bands that, according to Wood, "emphasize features that the human brain uses when processing speech" -- into a continuous audio signal.

The phenome-to-spectrogram interpreter network is sequence to sequence, Wood noted, meaning it doesn't compute an output solely from the corresponding inputs, instead considering its position in the sequence of outputs. Scientists at Amazon trained it with phenome sequences and corresponding sequences of mel-spectrograms, in addition to a “style encoding," the latter of which identified the specific speaking style used in the training example.

The output of the model was fed into a vocoder that generated high-quality speech waveforms. Uniquely, the vocoder can take mel-spectrograms from any speaker, regardless of whether they were seen during training time, and it doesn't require the use of a speaker encoding.

The result? A model-training method that combines a large amount of neutral-style speech data with only a few hours of supplementary data in the desired style, and an AI system capable of distinguishing elements of speech both independent of a speaking style and unique to a style.

"When presented with a speaking-style code during operation, the network predicts the prosodic pattern suitable for that style and applies it to a separately generated, style-agnostic representation," Wood explained. "The high quality achieved with relatively little additional training data allows for rapid expansion of speaking styles."

According to Amazon's research, listeners preferred voices generated with NTTS to those produced through concatenative synthesis.

"The preference for the neutral-style NTTS reflects the widely reported increase in general speech synthesis quality due to neural generative methods," Wood wrote. "The further improvement for the NTTS newscaster voice reflects our system’s ability to capture a style relevant to the text."

The new research follows the debut of Alexa's whisper mode, which enables Alexa to respond to whispered speech by whispering back.

More