Microsoft launches Custom Neural Voice in limited access

Microsoft today announced the general availability of Custom Neural Voice, an Azure Cognitive Services product that lets developers create synthetic voices with neural text-to-speech technology. It's in limited access, meaning customers must apply and be approved by Microsoft, but it's ready for production and available in most Azure cloud regions.

Brand voices like Progressive's Flo are often tasked with recording phone trees for elearning scripts used in corporate training videos. Synthetization could boost actors' productivity by cutting down on additional recordings and pickups -- the recording sessions to address mistakes, changes, or additions in voiceover scripts. At the same time, it could free them up to pursue creative work and enable them to collect residuals.

With Custom Neural Voice, prosody -- the tone and duration of each phoneme, the unit of sound that distinguishes one word from another -- is combined so machine learning models running in Azure can closely reproduce an actor's voice or a wholly original voice. One set of models converts a script into an acoustic sequence, predicting prosody, while another set of models converts that acoustic sequence into speech. Microsoft claims that because the models can simultaneously predict the right prosody and synthesize a voice, Custom Neural Voice results in more natural-sounding voices.

[audio mp3="https://venturebeat.com/wp-content/uploads/2021/02/Progressive-Flo-Examples-4.mp3"][/audio]

Custom Neural Voice includes controls to help prevent misuse of the service, according to Microsoft. When a customer submits a recording, the voice actor makes a statement acknowledging that they (1) understand the technology and (2) are aware the customer is having a voice made. The recording is compared with the model training data using speaker verification to make sure the voices match before a customer can begin creating the voice. Microsoft also contractually requires customers to get consent from voice talent.

Beyond this, Microsoft says it reviews each potential use case and has customers agree to a code of conduct before they can begin using Custom Neural Voice. "We require customers to make very clear it's a synthetic voice," Sarah Bird, responsible AI lead for Cognitive Services within Azure AI, said in a statement. "When it's not immediately obvious in context, [customers must] explicitly disclose it's synthetic in a way that's perceivable by users and not buried in terms."

Microsoft says it's also working on a way to embed a digital watermark within a synthetic voice to indicate that the content was created with a Custom Neural Voice.

Microsoft is effectively going toe to toe with Google, which in 2019 debuted new AI-synthesized WaveNet voices and standard voices in its Cloud Text-to-Speech service. It has another rival in Amazon, which recently launched a service -- Brand Voice -- that taps AI to generate custom spokespeople and offers a number of voice styles and emotion styles through Amazon Polly, Amazon's cloud offering that converts text into speech.

AT&T has used Custom Neural Voice to create a Bugs Bunny soundalike at a retail location in Dallas from around 2,000 phrases and lines supplied by a voice actor. Duolingo is using the service to introduce a cast of multilingual characters within its language learning apps. Progressive created a Facebook Messenger chatbot with the voice of Flo. And Microsoft worked with a nonprofit in Beijing, China, using Custom Neural Voice and a team of volunteers to generate content to be donated to the Beijing Hongdandan Visually Impaired Service Center, which provides resources for people who are blind or have limited vision.

More