Google researchers have found ways to make machine-generated speech sound more natural to humans, members of Google’s Brain and Machine Perception teams said today in a blog post that included samples of the more expressive voices. Earlier today, Google announced the beta release of its Cloud Text-to-Speech services to provide customers with the same speech synthesis used by Google Assistant. Google’s Cloud Text-to-Speech is powered by DeepMind’s WaveNet, which can also be used to generate natural-sounding voices.
Services like text-to-speech and research methods introduced today could be used to bring more natural speech to devices, apps, or digital services that utilize voice control or voice computing.
The new methods for making voices sound human are presented in two recently published articles about how to mimic things like stress or intonation in speech, sounds referred to in linguistics as prosody. Both papers document techniques that build on top of Tacotron 2, an AI system using neural networks trained to mimic human speech that made its debut last December.
Though Tacotron sounded like a human voice to the majority of people in an initial test with 800 subjects, it’s unable to imitate things like stress or a speaker’s natural intonation. In the first study coauthored by Tacotron co-creator Yuxuan Wang, transfer of things like stress level were achieved by embedding style from a recorded clip of human speech.
“This embedding captures characteristics of the audio that are independent of phonetic information and idiosyncratic speaker traits — these are attributes like stress, intonation, and timing,” researcher Yuxuan Wang and engineer RJ Skerry-Ryan said in a blog post. “At inference time, we can use this embedding to perform prosody transfer, generating speech in the voice of a completely different speaker but exhibiting the prosody of the reference.”
The second paper, authored in part by Skerry-Ryan, uses unsupervised training to identify speech patterns and imitate certain speech styles.
While the first method for prosody transfer is dependent on imitating speech of similar length and sentence structure, methods used in the second paper achieve speech style transfers for things like an angry or lively tone without needing a sound recording whose tone is being imitated or needing to imitate speech of similar length.
“This is a promising result, as it paves the way for voice interaction designers to use their own voice to customize speech synthesis,” Wang and Skerry-Ryan said.
In addition to Google’s text-to-speech and speech recognition services, tech for a more expressive voice could also lead to a more human-sounding Google Assistant. The move toward voices with greater range appears to be part of the general strategy for tech giants with voice-based assistants.
Siri got a more expressive voice last year, and last April Alexa got SSML tags that let voice app developers add expression to the assistant’s voice — like a pause or whisper or expressions like “BOOM” or “Bada bing.” SSML has also been made available for the makers of Google Assistant actions.