The 3 next steps in conversational AI

Conversational AI is a subfield of artificial intelligence focused on producing natural and seamless conversations between humans and computers. We've seen several amazing advances on this front in recent years, with significant improvements in automatic speech recognition (ASR), text to speech (TTS), and intent recognition, as well as the rocketship growth of voice assistant devices like the Amazon Echo and Google Home, with estimates of close to 100 million devices in homes in 2018.

But we're still a long way away from the fluent human-machine conversation promised in science fiction. Here are some key advances we should see over the next decade that could get us closer to that long-term vision.

New tools beyond machine learning

Machine learning, and in particular deep learning, has become an extremely popular technique within the field of AI over the past few years. It has already fueled significant advances in domains such as facial recognition, speech recognition, and object recognition, leading many to believe it will solve all of the problems of conversational AI. However, in reality it will be only one valuable tool in our toolbox. We'll need other techniques to manage all aspects of an effective human-computer conversation.

Machine learning is particularly well suited to problems that involve finding patterns in large corpora of data. Or as Turing Award winner Judea Pearl pithily said, machine learning essentially resolves to curve fitting. There are several problems in conversational AI that map well to this type of solution, such as speech recognition and speech synthesis. The technique has also been applied to intent recognition (taking a textual sentence of human language and converting that into a high-level description of the user’s intent or desire) with good success, though there are some limitations in using this technique to capture meaning from natural language, which is inherently stateful, sensitive to context, and often ambiguous.

However, there are certainly problems in computer conversation that are not as well suited to machine learning. Think of human-machine conversation as being composed of two parts:

Natural language understanding (NLU) -- understanding what the user said
Natural language generation (NLG) -- formulating a reasonable and on-topic response to the user.

Much of the attention of late has been focused on that first part, but there are many challenges remaining on the generation side, and these tend not to be well suited to machine learning because response generation isn't simply a product of collecting and analyzing lots of data. The challenge of maintaining a believable, ongoing, and stateful conversation will require more focus on these NLG and dialog management parts of the problem over the coming years.

Higher fidelity experiences

Conversational experiences today can be quite simple and constrained. In order to move beyond these limitations we will need to support higher fidelity conversations. There are several parts to achieving this, including:

Wide and deep conversations. Most conversational experiences today are either very broad but shallow (e.g., “What’s the time?” => “The time is 9.45am”) or very narrow but deep (e.g., a multi-turn conversation in a quiz game). To advance beyond these limited experiences, we will need to get to a world of both wide and deep conversations. This will require a much better understanding of the context of a user’s input to be able to respond appropriately, robust tracking of the state (memory) of a conversation, as well as the ability to scale beyond the current technical limitations of recognizing between only a few hundred intents at a time.
Personalization. In a natural conversation between two people, each will normally draw on previous experiences with the other converser and will tailor their responses to that person. Computer conversations that don’t do this tend to feel unnatural and even annoying. Addressing this in the long term will require solving challenges such as speaker identification, so that the computer knows who you are and can respond differently to you versus someone else. Another aspect will be tracking state for previous conversations and being able to respond differently over time, such as learning the preferences or style of the specific user.
Multimodal input and output. Currently, conversational AI focuses on understanding spoken inputs and generating spoken responses. However, users could provide inputs in many different ways, and outputs could be generated in different forms too. For example, a user could press a button on a screen in addition to providing a spoken input. Or sentiment analysis could be used to provide an emotional-level input that the computer can react to. Supporting multiple inputs or outputs at the same time opens up a range of complexities that need to be considered. For example, if the user says “No” while pressing a “Yes” button, what should the system do?

Finding the right role for humans in the loop

As technologists, we are often driven to try to solve every problem computationally. However, it’s important to note that some domains, such as gaming and entertainment or sales and marketing, may always want to finely craft the voice and personality of the computer responses to match their brand. Also, it’s been noted recently that trying to produce fully automated natural language generation may not be the best way forward because the most natural human conversations are not the result of rehashing lots of previous conversations but are instead formed by considering the current context, the unique conversational history between the two parties, and a set of broader conversational skills and conventions.

These arguments suggest that keeping a human in the loop of initial dialog generation may actually be a good thing, rather than something we must seek to eradicate. When I worked at Pixar on Finding Nemo, one of the big technical challenges was simulating the appearance and behavior of water. But even more difficult than solving the underlying physics simulation problem was that the water had to be human-directable: The film's director had to be able to request changes to how the water looked and reacted in a scene. That same qualifier will be true in the field of conversational AI: Natural language generation solutions must allow for input by a human "creative director" able to control the tone, style, and personality of the synthetic character.

Today, these creative inputs are necessarily at the level of a human writing individual responses for each context that the system can recognize and also defining how the conversation should flow onto the next question or topic. This is how practically all computer conversation experiences work at the moment. It seems unlikely we will completely remove this human-in-the-loop over the next few years, so as we look toward the future, we will want to build ways that support more scalable and broad mechanisms to define the voice and tone of a computer response, for example, by being able to define its key characteristics at a more abstract level.

The HBO series Westworld does a great job of presenting this view of the world. The artificial “hosts” are obviously very complex and often indistinguishable from flesh and blood humans in terms of their responses and behaviors. However, this is achieved by having many writers in the “narrative” department defining the content for each host and their various high-level personality traits. Creative designers can tweak these factors using powerful visual authoring tools.

Over the coming years, the field could benefit from the development of flexible authoring tools to empower conversation writers in much the same way that tools like Photoshop empowered artists or Final Cut Pro empowered video creators.

A combination of richer tools for language generation and dialog management systems, higher fidelity experiences, and improved use of humans in the loop will produce better content and ultimately launch us forward into a world populated with delightful and seamless computer conversation experiences.

Martin Reddy, is cofounder and CTO at voice technology company PullString.

New tools beyond machine learning

Higher fidelity experiences

Finding the right role for humans in the loop

More