Microsoft today said it has developed a new way for its most popular AI-powered bots to speak and analyze human voices at the same time, a skill engineers believe leads to more naturalistic conversations. The bots are empowered to predict what a person will say next, when to pause, and when it’s appropriate to interrupt someone.
Major virtual assistants have gained more expressive, human-like voices and are being trained to understand human emotion through voice analysis. But despite heavy investment by tech giants, exchanges between virtual assistants and people today can still be rather rudimentary, requiring the use of a wake word to carry out each command and falling short of the casual speech patterns that define human interaction.
The new way to talk debuts with Microsoft’s Xiaoice in China and Rinna in Japan. Xiaoice can chat through Xiaomi’s Yeelight, a smart speaker that looks identical to Amazon’s Echo Dot released two months ago.
Microsoft plans to extend the conversational feature to additional devices within the next six months, Zo AI director Ying Wang told VentureBeat in an email. In the U.S., Microsoft’s Zo will receive the new feature for Skype soon, and it will also be expanded to Ruuh in India and Rinna bot in Indonesia. No specific date or time period was provided for when the capabilities would be made available to additional bots.
The more natural way of speaking is called “full duplex voice sense” by Microsoft and gives bots that communicate via voice the ability to carry on a continuous conversation with just a single use of a wake word like “Hey, Cortana.” This enables people to speak with machines in a way that feels more like a phone call or conversation.
To learn things like when it’s appropriate to interrupt a person who is talking, full duplex voice sense uses knowledge drawn from conversations Microsoft’s bots have had with 200 million people around the world in recent years.
Whether or a bot chooses to interrupt a person depends on the question or command they have given the bot.
“If Xiaoice is telling a story, she will not be easily interrupted by murmurs and chats, unless there is explicit intent from the user to stop. Similarly, on Yeelight, when Xiaoice is handling a high-value IoT task, such as charging status of a robot vacuum, Xiaoice will choose to skip non-explicit intent from users, such as injections like ‘umm’ or ‘huh’,” Wang said.
Giving machines more natural ways to talk to humans isn’t just designed just to make it easier to get things done, it’s aimed at making casual chit-chat more attractive, something Microsoft has long maintained can lead to higher levels of user engagement. That’s probably why Amazon introduced multi-turn dialogue for Alexa to answer follow-up questions and again plans to host the Alexa Prize to find bots that can maintain a conversation for 20 minutes.
Microsoft AI and Research Group VP Harry Shum has said the company will roll out bots like Zo and Xiaoice to every country with a population of more than 100 million people.
The audio problem: Learn how new cloud-based API solutions are solving imperfect, frustrating audio in video conferences. Access here