Months after Amazon announced general availability of Neural Text-To-Speech (NTTS) and newscaster style in Amazon Polly, a cloud service that converts text into speech, the Seattle company today debuted two new NTTS voices in U.S. Spanish and Brazilian Portuguese: Lupe and Camila. Like the U.S. English NTTS voice before them, they mimic things like stress and intonation in speech by identifying tonal patterns.
Neural versions of Camila and Lupe are available in Amazon Web Services’ (AWS) U.S. East (N. Virginia), U.S. West (Oregon), and EU (Ireland) regions. Standard variants are also available across 18 AWS regions, bringing Polly’s total number of voices to 61 across 29 languages and the total number of voices available in both standard and neural versions to 13 across four languages.
Here’s a preview of Lupe:
Here’s a preview of Camila:
According to Amazon text-to-speech program manager Marta Smolarek, the new U.S. Spanish voice — Lupe, which is the third U.S. text-to-speech voice in Polly — not only speaks Spanish but also handles English and provides a fully bilingual Spanish-English experience. It covers 72 English and Spanish variants of phoneme (perceptually distinct units of sound in a specified language that distinguish one word from another), compared with the only 29 phonemes in the phone set for the Spanish-language Polly voices Penélope and Miguel.
Up to 1 million characters for Neural Text-To-Speech voices per month are free for the first 12 months, starting from the first request for speech (standard or NTTS). It’s a paid affair after that.
Amazon detailed its work on Neural Text-To-Speech in a research paper late last year (“Effect of data reduction on sequence-to-sequence neural TTS”), in which researchers described a system that can learn to adopt a new speaking style from just a few hours of training — as opposed to the tens of hours it might take a voice actor to read in a target style.
Amazon’s AI model consists of two components. The first is a generative neural network that converts a sequence of phonemes into a sequence of spectrograms, or visual representations of the spectrum of frequencies of sound as they vary with time. The second is a vocoder that converts those spectrograms into a continuous audio signal.
The phoneme-to-spectrogram interpreter network is sequence to sequence, meaning it doesn’t compute an output solely from the corresponding inputs, instead considering its position in the sequence of outputs. Scientists at Amazon trained it with phoneme sequences and corresponding sequences of spectrograms, in addition to a “style encoding” that identified the specific speaking style used in the training example. The model’s output was fed into a vocoder that can take spectrograms from any speaker, regardless of whether they were seen during training time.
The end result? An AI model-training method that combines a large amount of neutral-style speech data with only a few hours of supplementary data in the desired style, and an AI system capable of distinguishing elements of speech both independent of a speaking style and unique to that style.
With Neural Text-To-Speech and newscaster style, Amazon is effectively going toe to toe with Google, which in February debuted 31 new WaveNet voices and 24 new standard voices in its Cloud Text-to-Speech service (bringing the total number of WaveNet voices to 57). It has another rival in Microsoft, which offers three AI-generated voices in preview and 75 standard voices via its Azure Speech Service API.