Today, the virtual assistant landscape is exploding with innovation: New applications and new forms of interaction are constantly emerging. Although the idea of a virtual assistant is decades old, it went mainstream with Apple’s introduction of Siri. Siri was created at SRI International based on years of AI research, spun off as an independent venture-backed company in 2007, and acquired by Apple in 2010.

The Siri that the world knows enables users to quickly find information and execute important device functions in a fast and friendly way. But Siri was first developed as a “do engine,” similar to the emerging crop of AI assistants. In the original spin-off, Siri could understand the user’s intent, determine a set of web services to fulfill that intent, and present a screen to the user that made it easy to complete fulfillment.

As people want to do more and more with their smartphones — including shopping, banking, and healthcare — the “do” engines of the future must be virtual specialists with deep knowledge in specific areas. Take a customer shopping for clothes in a retail store. The customer probably has particular needs and preferences, while the store clerk has knowledge of clothing in general, the store’s inventory, and fashion trends. Together, they will solve the problem of finding clothes for the customer. The customer knows that the clerk represents the store but expects the clerk to learn about and care about the customer’s specific needs. The clerk interprets the customer’s requests in the context of what the store has to offer, the perceived preferences of the customer, and the customer’s response to the clerk’s comments and suggestions.

For a virtual specialist to approach this kind of capability, it must embody the store clerk’s specific knowledge, and it must also know how to have a conversation with the user.

The power of conversation

For task-oriented conversations — those in which people are trying to get something done — conversation isn’t simply a nice-to-have attribute. This kind of conversation is a co-discovery process in which the participants are collaborating to solve a problem. Conversation is a complex activity that we humans are so skilled at that we scarcely think about it.

During conversations, we assume knowledge on the part of the other person, and we introduce subtopics without warning and expect the other person to address them without losing track of the overall topic. We use pronouns and hyponyms to refer to things that we said previously (but not too long ago). We even expect the other person to (gently) get us back on track if we have been diverging.

Human conversation is not just verbal. We use gestures, gaze, and facial expressions to convey information and emotion. This is critical in in-person settings and is increasingly important online, as video interaction becomes commonplace. This trend will intensify as the Internet of Things finally comes to fruition. In a world of sensors and smart spaces, physical action, even walking into a room, becomes part of the interaction.

We combine verbal and non-verbal communication freely, in fact, sometimes unconsciously through “body language.” We know we’re not fully communicating when we only have one mode, e.g., over the phone or playing charades. So the next generation of virtual specialists will need to fluidly combine verbal and non-verbal interaction to approach human conversational ability.

Adapting to user state

We expect to have a positive experience when we interact with people to get business done. Part of this is empathy, e.g., soothing us when we are frustrated, but it goes beyond that. If we are confused, we expect the person — or the computer system — to help us out, to explain things a different way, slow down, use pictures, etc. On the other hand, if we are impatient with a long explanation, we want things to speed up. To even approach human capability, the virtual specialist must recognize and adapt to the user’s state, whether that involves awareness (distracted?), emotion (angry?), or comprehension (confused?). While speech recognition is getting quite good and recognition of gestures and facial expression is getting much more accurate, there are very few computer systems that use verbal and visual information to recognize user state, and even fewer that change their behavior in response to perceived user states. This kind of adaptation is critical to user experience and to long-term business success.

Computer systems, especially virtual specialists, have an advantage over store clerks: They work with the same user over extended periods. This makes it possible for the virtual specialist to develop accurate models of how the user state is changing, over minutes or over years. Besides adapting to the user’s state, it can anticipate changes in state, and perhaps head off negative states. For example, a virtual specialist may be able to pick up important cues about the user’s mental and physical health that can help with diagnosis and treatment.

Real personalization

Like their human counterparts, virtual specialists will be expected to learn users’ preferences and objectives and tailor their responses and actions accordingly. We are all used to getting personalized recommendations from services like Amazon or Netflix. But we recognize that these recommendations are really based on “people like you” analytics; we are being put into categories derived from the behavior of large numbers of users. Because virtual specialists have deep conversational interactions with us over long periods of time, they can learn a lot of specific information about us — and they don’t have to export that information to a big database to make it useful. This creates the opportunity for real personalization: personalization based on not-to-be-shared information, personal behavior patterns, expressed preferences, etc. For example, our store clerk virtual specialist may know that we have a big event coming up and suggest just the right new items from our favorite brands. And, as we saw above, this real personalization can extend into more serious parts of our life, like healthcare and wellness.

Establishing trust

Finally, meaningful collaboration with computer systems will eventually come down to trust. We have well-developed norms for how and when to trust people in various situations and mechanisms for testing and building trust. We trust the bank clerk to perform transactions for us, but we check our account balance. We seek the store clerk’s recommendation, but we know that it may be based on incentives that are different from our own.

Creating computer systems that we can trust requires advances in a number of computer science disciplines, ranging from system design to knowledge representation. Whatever is going on inside of the systems, we will need time and experience to learn how far to trust them. The next generation of virtual specialists can help us along the way. One important step forward is being able to answer questions: Why did you recommend this? Why didn’t you recommend that? Another is for them to tell us when and why they think our behavior will lead to a bad result, and when and why their behavior may surprise us.

Trust won’t — and shouldn’t — be established overnight, but conversational virtual specialists can provide mechanisms for learning about the systems and exploring the boundaries of trust.