Not to be outdone by Google’s WaveNet, which mimics things like stress and intonation in speech by identifying tonal patterns, Amazon today announced the general availability of Neural Text-To-Speech and newscaster style in Amazon Polly, its cloud service that converts text into speech.
As Amazon Web Services tech evangelist Julien Simon noted in a blog post, Neural Text-To-Speech delivers significant improvements in speech quality by increasing naturalness and expressiveness.
Here’s an example:
As for newscaster style, which makes narration sound “even more realistic” for content like news articles and blog posts, Simon says it was made possible by Neural Text-To-Speech’s underlying machine learning algorithms. “Thanks to Polly and the newscaster style, [listeners] … can enjoy articles read in a high-quality voice that sounds like what they might expect to hear on the TV or radio,” he wrote.
Customers like the Globe and Mail, Gannett, BlueToad, TIM Media, Encyclopedia Britannica, nonprofit education tech company CommonLit, and game developer Volley are already using newscaster style via Polly. And in January Amazon quietly rolled it out to Alexa-enabled devices for daily briefings and Wikipedia snippet narrations.
Here’s what it sounds like:
Newscaster style is available for two English voices, while Neural Text-To-Speech is available for 11 voices, including three U.S. English voices and eight U.S. English voices. Both work in real time and in batch mode, and they’re currently accessible in the US East (N. Virginia), US West (Oregon), and Europe (Ireland) AWS regions.
Up to 1 million characters for Neural Text-To-Speech voices per month are free for the first 12 months, starting from the first request for speech (standard or NTTS). It’s a paid affair after that.
Generating humanlike speech using AI
Amazon detailed its work on Neural Text-To-Speech in a research paper late last year (“Effect of data reduction on sequence-to-sequence neural TTS“), in which researchers described a system that can learn to adopt a new speaking style from just a few hours of training — as opposed to the tens of hours it might take a voice actor to read in a target style.
Amazon’s AI model consists of two components. The first is a generative neural network that converts a sequence of phonemes — perceptually distinct units of sound that distinguish one word from another, such as the p, b, d, and t in pad and pat — into a sequence of spectrograms, or visual representations of the spectrum of frequencies of sound as they vary with time. The second is a vocoder that converts those spectrograms into a continuous audio signal.
The phoneme-to-spectrogram interpreter network is sequence to sequence, meaning it doesn’t compute an output solely from the corresponding inputs, instead considering its position in the sequence of outputs. Scientists at Amazon trained it with phoneme sequences and corresponding sequences of spectrograms, in addition to a “style encoding,” which identified the specific speaking style used in the training example. The model’s output was fed into a vocoder that can take spectrograms from any speaker, regardless of whether they were seen during training time.
The end result? An AI model-training method that combines a large amount of neutral-style speech data with only a few hours of supplementary data in the desired style, and an AI system capable of distinguishing elements of speech both independent of a speaking style and unique to that style. “The ability to teach Alexa to adapt her speaking style based on the context of the customer’s request opens the possibility to deliver new and delightful experiences that were previously unthinkable,” wrote Amazon TTS Research senior manager Andrew Breen in a previous blog post.
With Neural Text-To-Speech and newscaster style, Amazon is effectively going toe to toe with Google, which in February debuted 31 new WaveNet voices and 24 new standard voices in its Cloud Text-to-Speech service (bringing the total number of WaveNet voices to 57). It has another rival in Microsoft, which offers three AI-generated voices in preview and 75 standard voices via its Azure Speech Service API.