Google researchers have found ways to make machine-generated speech sound more natural to humans, members of Google’s Brain and Machine Perception teams said today in a blog post with samples of the more expressive voices. Earlier today, Google announced the beta release of its Cloud Text-to-Speech services to provide customers the same speech synthesis used by Google Assistant. Google’s Cloud Text-to-Speech is powered by DeepMind’s WaveNet, which can also be used to generate natural-sounding voices.

Services like text-to-speech and research methods introduced today could be used to bring more natural speech to devices, apps, or digital services that utilize voice control or voice computing.

The new ways for making voices sound human are presented in two recently-published articles about how to mimic things like stress or intonation in speech, sounds referred to in linguistics as prosody. Both papers document techniques that build on top of Tacotron 2, an AI system that uses neural networks trained to mimic human speech that made its debut last December.

Though Tacotron sounds like a human voice to the majority of people in an initial test with 800 subjects, it’s unable to imitate things like stress or a speaker’s natural intonation. In the first study coauthored by Tacotron cocreator Yuxuan Wang, transfer of things like stress level were achieved by embedding style from a recorded clip of human speech.

“This embedding captures characteristics of the audio that are independent of phonetic information and idiosyncratic speaker traits — these are attributes like stress, intonation, and timing,” researcher Yuxuan Wang and engineer RJ Skerry-Ryan said in a blog post. “At inference time, we can use this embedding to perform prosody transfer, generating speech in the voice of a completely different speaker, but exhibiting the prosody of the reference.”

The second paper, authored in part by Skerry-Ryan, uses unsupervised training to identify speech patterns and imitate certain speech styles.

While the first method for prosody transfer is dependent on imitating speech of similar length and sentence structure, methods used in the second paper achieve speech style transfers for things like an angry or lively tone without the need for the sound recording whose tone is being imitated or a need to imitate speech of similar length in order to work.

“This is a promising result, as it paves the way for voice interaction designers to use their own voice to customize speech synthesis,” Wang and Skerry-Ryan said.

In addition to Google’s text-to-speech and speech recognition services, tech for a more expressive voice could also lead to a more human-sounding Google Assistant. Getting away from monotonous voices without range appears to be part of the strategy for tech giants with assistants like Alexa, Siri, and Google Assistant.

Siri got a more expressive voice last year, and last April, Alexa got SSML tags for voice app developers to add expression to the assistant’s voice like a pause, whisper, or expressions like “BOOM” or “Bada bing.” SSML has also been made available for the makers of Google Assistant actions.