Microsoft today announced the launch of new neural text-to-speech (TTS) capabilities in Azure Cognitive Services, its suite of AI-imbued APIs and SDKs that enable developers to tailor the voice of their apps and services to fit their brand. Each of three new styles — newscast, customer service, and digital assistant — promises natural-sounding speech that matches the patterns and intonations of human voices.
“Built on a powerful base model, our neural TTS voices are very natural, reliable, and expressive. Through transfer learning, the neural TTS model can learn different speaking styles from various speakers, enabling nuanced voices,” wrote Microsoft in a blog post.
The newscast voice reflects a “professional tone” you might hear on a TV or radio newscast, which is to say it contains no trace of regionalism and uses standard broadcasting pronunciation, a form of pronunciation in which no letters are dropped. In addition to Azure Cognitive Services, Microsoft says the newscast-style voice is in the Microsoft Listening Docs for WeChat, which can read aloud Word, PowerPoint, and Excel documents and generate audio for online trainings, news podcasts, and more. It’s also in the Bing mobile app — when you search with the voice search feature, you’ll hear the news briefs using the newscast voice:
The customer service-style voice features a “friendly” and “engaging” tone that Microsoft says is tuned for scenarios involving customer support, like reporting a claim. By contrast, the digital assistant voice — which is available in two styles, a chat style for casual, conversational bots and a professional style for applications like in-car digital assistants — features a helpful tone that’s suited to relaying weather forecasts, navigation directions, reminders, and other information.
Beyond the voice styles optimized for specific scenarios, Microsoft this morning released several new emotion styles, which can be adjusted to express different emotions to fit a given context. There’s cheerfulness or empathy, and in Chinese a lyrical style, which Microsoft describes as “heartfelt” and optimized to read prose or poetry.
The new voice styles are available in English and Chinese, while the emotion styles are available for English, Chinese, and Brazilian Portuguese. Microsoft notes that the styles can be customized through the Custom Neural Voice feature within Microsoft Speech Studio, allowing brands to build unique voices that benefit from the new scenarios.
Microsoft is effectively going toe to toe with Google, which last year debuted 31 new AI-synthesized WaveNet voices and 24 new standard voices in its Cloud Text-to-Speech service (bringing the total number of WaveNet voices to 57). It has another rival in Amazon, which recently launched a service — Brand Voice — that taps AI to generate custom spokespeople and offers a number of voice styles and emotion styles through Amazon Polly, Amazon’s cloud offering that converts text into speech.