Amazon releases long-form speaking style for Alexa skills

Amazon today announced a long-form speaking style for news and music content within third-party Alexa skills (i.e., voice apps). Starting this week in the U.S., developers can use the style, which is optimized for large amounts of textual information, to read aloud web pages, articles, podcasts, and storytelling portions of games.

The new speaking style could improve experiences by making verbalized text sound more natural, and by extension boost overall user engagement. Additionally, it could save developers cash and effort by eliminating the need to hire professional voice actors, as well as eliminate hours spent recording audio in a studio.

Amazon says the long-form speaking style is powered by an AI text-to-speech model that incorporates natural pauses while transitioning from one paragraph to the next, and even from one dialog to another. That's akin to a recently launched Google Assistant feature that reads long-form content on websites and within Android apps using a more natural and humanlike voice.

Here's how it sounds:

[audio mp3="https://venturebeat.com/wp-content/uploads/2020/04/LongformNarrativeSample2._CB434447643_.mp3"][/audio]

And here's the default Alexa style:

[audio mp3="https://venturebeat.com/wp-content/uploads/2020/04/LongformNeutralSample2._CB434447643_.mp3"][/audio]

Beyond the long-form speaking style, Amazon says that developers can now use the news and conversational speaking styles from Amazon Polly, Amazon's cloud service that converts text into lifelike speech, in 29 languages for select voices -- dubbed Matthew, Joanna, and Lupe -- in Alexa skills. The news speaking style sounds similar to what you might hear from TV news anchors and radio hosts, while the conversational speaking style makes the voices sound less formal and as if they're speaking to friends and family.

Amazon detailed its work on AI-generated speech in a research paper late last year, in which researchers described a system that can learn to adopt a new speaking style from just a few hours of training -- as opposed to the tens of hours it might take a voice actor to read in a target style. The company's model consists of a generative neural network that converts a sequence of phonemes into a sequence of spectrograms, or visual representations of the spectrum of frequencies of sound as they vary with time, coupled with a vocoder that converts those spectrograms into a continuous audio signal.

The end result is an AI model-training method that combines a large amount of neutral-style speech data with a few hours of supplementary data in the desired style, as well as an AI system capable of distinguishing elements of speech both independent of a speaking style and unique to that style. Amazon has used it internally to produce new voices for Alexa, as well as developer-facing voices across several languages in Amazon Polly.

Finally, Amazon says that Alexa voice app developers can use 10 additional Amazon Polly voices in several new languages, including U.S. English, U.S. Spanish, Canadian French, Brazilian Portuguese, and more.

The trio of developments comes after Amazon released new emotions and speaking styles for Alexa skills, including "happy/excited," "disappointed/empathetic," and short-length news and music styles. In a November blog post, Amazon claimed that the emotional voices increased customer satisfaction by 30% and that users perceived the news style and music style to be 31% more natural and 84% more natural, respectively, than Alexa's standard voice.

Amazon also recently launched Brand Voices, an Amazon Polly feature that taps AI to generate custom spokespeople. The fully managed service pairs customers with in-house engineers to build AI-generated voices representing certain personas, like a Southern U.S. English accent for KFC in Canada and an Australian English voice for National Australia Bank.

More