How to build smarter chatbots

We're going to be blunt: Chatbots in their current form aren't great.

We were promised bots that would change the way we interact with businesses and services, but instead we have interactive bots that perform worse than apps. 33,000 chatbots have been launched on Facebook Messenger in the last 6 months, yet most of these chatbots have very little conversational or chat capabilities. They are primarily focused on taps or interactive graphical interfaces, and conversing with them using natural language is nearly impossible.

Take an example of Poncho Weather on Facebook Messenger. Let's say I'm going to a conference next Monday in San Diego and want to know what the forecast is. As you can see, the bot struggles to understand my on-topic questions.

Language is hard. Developers are working with the technology that's available to them. So how can we turn dumb chat bots into intelligent conversational agents? We think you already know the answer: artificial intelligence.

State of the bot today

Maluuba is an artificial intelligence company focused on understanding natural language. Over the past year, we have been working on technology that's going to power conversational agents.

We want to build conversational agents that are able to carry on a dialogue just like you and I can. People don't talk to each other in scripted ways, and language is filled with nuance. Different ways of saying the same things, slang and colloquialism, and moving around from topic to topic in a conversation are big challenges for developers who want to deliver truly conversational experiences.

Example: Customer wants to buy a TV

Let's take a look at an example where a consumer wants to buy a new television. We will demonstrate how Maluuba's technology is solving problems that arise in this scenario. The consumer probably has something in mind; what size of screen and how much she wants to spend, but she may not have considered every detail. She starts chatting with a retailer chatbot agent.

USER: I'd like to get a new TV, ideally around 50 inches.

A well-scripted bot may pick up on the word "TV" and that the size should be equal to or close to 50 inches.

BOT: Great, I can help with that. Are you looking for a 4K or 1080p?

The bot has asked a valid question. Maybe there are buttons the user can tap, or perhaps she can type her answer. But what if the user sends this message back:

USER: I don't know? Also, what's the difference between LED and OLED?

A scripted bot with canned responses might then reiterate its message, asking her to choose from 4K or 1080p. Or perhaps it can answer questions about LED vs. OLED. What if the user then says:

USER: Hmmm, that sounds good. Yep let's do that.

Is the user talking about LED vs OLED or 4k vs 1080p? There is a lot of information that the bot must keep "in state" as it tries to answer the user's questions. Given there are a lot of specifications to decide on with a TV and that the user may jump from topic to topic, this can be a big challenge. Does she have to go back to the beginning if she changes her mind? What if she decides she wants to order a computer monitor instead?

Assume that the user eventually gets to a point of deciding on a specific TV. The bot shows three options, perhaps sending photos or short video clips about 2 to 3 TVs. It would include pricing. Let's say there's sets for $2,000, $1,500, and $1,400.

USER: That's too expensive. I want something that's less than $1,000. Oh, and do you deliver?

As you can see, this conversation sounds completely normal. A human agent could easily understand and handle the shifting conversation. A scripted bot? Not so good.

Imagine if a conversational agent with human-like comprehension existed. You would actually open up iMessage or Messenger and converse with services and businesses you use. We are not saying intelligent conversational agents like this exists today, but 1) our technology can make current bots hella smarter and 2) we know we're on the right path and getting closer every day.

Four stages of bot understanding

Before we talk about the solution we have come up with, let's discuss some problems that exist with current chatbots:

Chatbots require a lot of manual programming.
They fail to understand users' requests.
They typically have no memory.
They don't understand follow up questions or contextual queries.
They can't make decisions on their own and fail miserably when they go off script.
They use templated responses without any variance that get repetitive over time.

Imagine you want to build a conversational agent that addresses the problems listed above. Which components do you think you would need? Well first, the agent has to decode whatever the user is saying. The agent then needs to learn how to follow a process of comprehension and action in the conversation. This breaks down into four stages:

Natural language understanding: Identify intents in human speech.
State tracking: Understand what is being discussed at this point in time and in the context of the full conversation.
Dialogue management: Reason over the course of the dialogue to decide what to say next.
Natural language generation: Respond to the human in natural language.

Machine learning has been used in order to tackle these challenges. However, because of the inherent complexity and structure of information exchange, a significant amount of handcrafting was often necessary in order to train models. Recent advances in deep learning have helped alleviate this.

How our bot learns

We will focus here on dialogue management, the decision-making component that often cannot be handcrafted.

Training the dialogue manager

The dialogue manager is the brain of the system. It decides what to say next to the user based on everything that has happened during the dialogue. The dialogue manager's training can be done in two steps. The idea is to emulate how a human being learns to perform a new task.

First, we learn about the task and about successful strategies by observing others perform the task. Then, we try performing the task and we adjust our behavior based on our own mistakes. Similarly, the dialogue manager first observes successful dialogues.

Using our retailer bot agent example, transcripts of conversations between human agents and customers can be used for this part of the training. This gives the dialogue manager a good start, but it also needs to learn to adapt to unseen behavior, with actual dialogue via a simulator and then in conversation with real users.

This is similar to how people learn things. When a child is learning to hit a baseball, she might first watch her dad demonstrating or watch games on TV. After a while, her dad will help her swing the bat a few times. Eventually the child starts swinging on her own. With every swing, she gets better.

Today, most chatbots involve a lot of manual setup, defining a limited range of scope, flows, and rules. Sure they understand the utterance, but the decision making and NLG are completely handcrafted. Maluuba's technology is leapfrogging the competition by learning the decision-making progress. Our algorithms can learn this from data before we deploy an agent to users. Maluuba can start to learn on the fly while interacting with real users via reinforcement learning. This is how DeepMind taught A.I. to play Atari and eventually taught AlphaGo to play Go so well it beat the human world champion.

Two-stage deep learning

From a technical point of view, the first stage of training is done through deep supervised learning on the corpus of human agent-customer conversations. During this stage, the learning bot is trained to predict dialogue acts given a representation of dialogue history. A dialogue act is a representation of the intent of a speaker. For instance,

We train the system to emulate the human agent's intents based on dialogue history. We used a deep neural network to learn this mapping between dialogue history and dialogue acts.

The second stage of training is done through deep reinforcement learning. In the reinforcement learning setting, the bot acts in an unknown environment and receives rewards for its actions. The bot's goal is to find the actions that maximize its expected sum of rewards at each state.

Here, the bot's actions are dialogue acts, the rewards are an estimate of user satisfaction, and the state is the representation of dialogue history. The unknown environment is user behavior: The bot has to try different intents based on dialogue history and then adjust its behavior based on the rewards, which reflect user satisfaction. This second step reuses the neural network of the first step and adjusts the mapping based on a second neural network that estimates the expected sums of rewards.

Maluuba's research has demonstrated that using this two-stage approach with deep neural networks leads to faster convergence and less handcrafting than previous state-of-the-art learning algorithms.

Conversational commerce needs dialogue

The promise of conversational chatbots has huge potential for sales, customer service, and other interactions. The ability for bots to scale, work without volume constraints, and continually learn will bring about many benefits for business and consumers. That said, businesses are rightly concerned that the experience should not only resolve the customer's query but that it do so in a manner that fits their expectation.

Customers will expect to interact with chatbots in the same way they interact with people. They'll ask questions, make statements, interject, change their minds, and move from topic to topic. To handle this, Maluuba is focused on building intelligent virtual agents using deep supervised and reinforcement learning. Our research into state tracking, dialogue management, and natural language generation will further train and improve the ability of bots to comprehend natural language and respond accordingly to satisfy users' requirements.