Digital assistants are increasingly making their way into our everyday lives, not only via mobile devices, but also through home devices, cars, and more.

As their scope widens, the technology powering them must evolve and deepen to fit people’s ever-growing expectations and pace of life. With the intelligence of these assistants maturing, businesses will see new uses that go beyond today’s simple question-and-answer interactions (Question: What is today’s weather? Answer: Sunny and 64 degrees Fahrenheit).This new generation of assistants will have implications beyond our original thinking — but it all starts with the tech.

Voice artificial intelligence (Voice AI) involves the application of artificial intelligence techniques to voice-based interactions, enabling users to converse with systems in a flexible and collaborative way. Imagine the following dialogue with a collaborative assistant:

User……….. <hears a song on the device>
“When is their next concert?”
System……….. “Coldplay’s next concert is in Seattle on March 25. Would you like me to get you tickets?”
User……….. “Yes”
System……….. “That concert is sold out, but there is a concert in Portland on March 24. Would you like tickets to that one?”
User……….. “Sure”
System……….. “Tickets are $35 to $100. Here is the seating chart”


The assistant has interpreted the user’s utterance in context (the song that is playing in the background), provided an answer, and gone beyond what was asked to infer what the user really wants — to attend the concert. It then offers to purchase tickets in order to satisfy the user’s higher-level goal of hearing the concert. However, the system finds that the initial plan will fail, so it searches for another way to achieve the goal of hearing the concert. Systems that users will want to use daily need to be capable of these kinds of collaborative responses.

We begin with speech recognition, the science of determining the words people have spoken. Speech recognition has made great progress in recent years because of development in deep learning with neural networks and a major hardware speed increase made possible by graphical processing units (GPUs). The upshot is that for human-computer interaction, error rates are less than 10 percent, and users can pretty much depend on their speech to computers being recognized correctly.

The next stage is natural language understanding — determining what the utterance means in context. Voice AI-based “semantic parsers” build representations of utterance meaning and do so compositionally. That is, the meaning of the whole depends on meanings of the parts (such as verbs, nouns, adjectives, etc.). Such systems are contextual in that the meanings they derive build on the prior linguistic and current situational context. This means they can, for example, infer what pronouns and noun phrases (such as “the nearest Japanese restaurant to the Space Needle”) you are referring to. They also process inputs from multiple modalities, including voice, vision, gesture, etc. to then fuse the meaningful information derived from each mode into a joint meaning. One important application of multimodal processing is to recognize users’ emotional states and respond appropriately in the situational context.

Once a semantic parser determines the meaning of an utterance, a dialogue management subsystem needs to decide what to do, such as whether to answer questions, perform actions, ask for more information, and so on. To be a collaborative assistant, the system’s response needs to further the user’s intentions and plans, not merely respond to what they literally said. This is illustrated in the example above with the concert. In order to respond collaboratively, the system infers the user’s underlying question and determines whether the plan will be successful. If the plan is not successful, the system will find another way to achieve the user’s goal.

Machine learning, especially “deep learning,” is a primary tool for building such advanced systems. For example, using deep learning, semantic parsers can be built to associate utterances with “meaning representations” of the user’s literal intent. In addition to deep learning for language processing, planning and plan recognition technologies will need to build collaborative assistants. All these components will learn from large datasets, including collections of utterances paired with “meaning representations,” plus large bodies of world knowledge encoded in knowledge graphs, as well as individual users’ patterns of actions and interactions.

The business implications for such deeply intuitive systems are vast. First, there will be significant business opportunities for technologies that enable systems to sense people, including cameras for vision-based person identification and tracking, as well as array microphones for reliably capturing speech. Building on these sensing technologies, intelligent collaborative dialogue systems will enable us to access information resources offered by businesses (such as healthcare information), form and execute complex plans involving multiple service providers (like planning a trip or instructing a robot), and interact with the Internet of Things, with its mobile devices, home-based devices, connected cars, and continually growing list of connected objects.

Voice-powered AI systems that infer people’s plans will provide an opportunity for businesses to judiciously offer products and services that are relevant to what people are trying to do. Because such voice commerce opportunities are precisely targeted at user intentions, they can be more successful than traditional advertising.

VentureBeat's mission is to be a digital town square for technical decision-makers to gain knowledge about transformative enterprise technology and transact. Learn more about membership.