AI isn't yet ready to pass for human on video calls

Leading up to Superbowl Sunday, Amazon flooded social media with coquettish ads teasing “Alexa’s new body.” Its gameday commercial depicts one woman’s fantasy of the AI voice assistant embodied by actor Michael B. Jordan, who seductively caters to her every whim — to the consternation of her increasingly irate husband. No doubt most viewers walked away giggling at the implausible idea of Amazon’s new line of spouse replacement robots, but the reality is that embodied, humanlike AI may be closer than you think.

Today, AI avatars — i.e., AI rendered with a digital body and/or face — lack the sex appeal of Michael B. Most, in fact, are downright creepy. Research shows that imbuing robots with humanlike features endears them to us — to a point. Past that threshold, the more humanlike a system appears, the more paradoxically repulsed we feel. That revulsion has a name: “The Uncanny Valley.” Masahiro Mori, the roboticist who coined the term, predicted a peak beyond the Uncanny Valley wherein robots become indistinguishable from humans, beguiling us once more. You can imagine such a robot would be capable of fooling us that it’s human on a video call: a twenty-first century refactoring of the old text-based Turing Test.

_{On a recent Zoom with legendary marketer Guy Kawasaki, I made a bold proclamation: In two years' time, Guy would be unable to distinguish between me and my company's conversational AI, Kuki, on a video call. Guy’s eyebrows arched at the claim, and caveats began to cascade from my big fat mouth. Maybe on a short video call. With low bandwidth. If he was drinking champagne and dialing in from a bubble bath, like the lady in the Alexa ad.}

On a recent Zoom with legendary marketer Guy Kawasaki, I made a bold proclamation: In two years' time, Guy would be unable to distinguish between me and my company's conversational AI, Kuki, on a video call. Guy’s eyebrows arched at the claim, and caveats began to cascade from my big fat mouth. Maybe on a short video call. With low bandwidth. If he was drinking champagne and dialing in from a bubble bath, like the lady in the Alexa ad.

So let this be my public mea culpa, and a more grounded prediction. An AI good enough to pass as human on a video call needs five key technologies running in real-time:

A humanlike avatar

Avatars have come a long way recently, thanks to the wide, cheap availability of motion capture technology (“MoCap”) and generative adversarial neural networks (“GANs”), the machine learning technique underlying Deep Fakes. MoCap, which allows actors to puppet characters via haptic suits and originally required the big budget backing of films like Avatar, is now accessible to anyone with an iPhone X and free game engine software. Numerous online web services make it trivial to create low-res deep fake images and video, democratizing technology that, if left unchecked, could be a death knell for democracy. Such advances have spawned new industries, from Japanese VTubers (a rising trend in the US recently co-opted by PewDiePie), to fake “AI” influencers like Lil’ Miquela that purport to virtualize talents but secretly rely on human models behind the scenes. With last week’s announcement of the “MetaHuman” creator from Epic Games (purveyors of Fortnite and the Unreal Engine in an industry that in 2020 surpassed movies and sports combined), soon anyone will be able to create and puppet infinite photorealistic fake faces, for free.

Technology enabling humanlike voices is also rapidly advancing. Amazon, Microsoft, and Google offer consumable cloud text-to-speech (TTS) APIs that, underpinned by neural networks, generate increasingly humanlike speech. Tools for creating custom voice fonts, modeled after a human actor using recorded sample sentences, are also readily available. Speech synthesis, like its now highly accurate counterpart speech recognition, will only continue to improve with more compute power and training data.

But a convincing AI voice and face are worthless without expressions to match. Computer vision via the front-facing camera has proved promising at deciphering human facial expressions, and off-the-shelf APIs can analyze the sentiment of text. Labs like NTT Data’s have showcased mimicking human gestures and expressions in real time, and Magic Leap’s MICA teased compelling nonverbal avatar expressions. Yet mirroring a human is one thing; building an AI with its own apparent autonomous mental and emotional state is another challenge altogether.

To avoid what Dr. Ari Shapiro calls The Uncanny Valley of Behavior, AI must display humanlike movements to match its “state of mind,” triggered procedurally and dynamically based on how the conversation is unfolding. Shapiro's work at USC’s ICT lab has been seminal in this field, along with startups like Speech Graphics, whose technology powers lip sync and facial expressions for gaming characters. Such systems take an avatar’s textual utterance, analyze the sentiment, and assign an appropriate animation from a library using rules, sometimes coupled with machine learning trained on videos of real humans moving. With more R&D and ML, procedural animation may well be seamless in two years’ time.

Humanlike conversation is the final, and hardest, piece of the puzzle. While chatbots can deliver business value within confined domains, most still struggle to carry on a basic conversation. Deep learning + more data + more compute power have so far failed to yield meaningful breakthroughs in natural language understanding relative to other AI fields like speech synthesis and computer vision.

The idea of humanlike AI is deeply sexy (to the tune +$320 million venture dollars and counting); but, for at least the next few years until the key components are “solved,” it's likely to remain a fantasy. And as avatar improvements outpace other advances, our expectations will rise — but so will our disappointment when virtual assistants’ pretty faces lack the EQ and brains to match. So it’s probably too early to speculate when a robot may fool a human over video calling, especially given that machines have yet to truly pass the traditional text-based Turing Test.

Maybe a more important question than (when?) can we create humanlike AI is: should we? Do the opportunities — for interactive media characters, for AI healthcare companions, for training or education — outweigh the dangers? And does humanlike AI necessarily mean “capable of passing as human,” or should we strive, as many industry insiders agree, for distinctly non-human stylized beings to sidestep the Uncanny Valley? Personally, as a lifelong sci-fi geek, I’ve always yearned for a super AI sidekick that’s humanlike enough to banter with me, and hope with the right regulation — beginning with baseline laws that all AIs self-identify as such — this technology will result in a net positive for humanity. Or, at the very least, a coin-operated celebrity doppelganger like Michael B. to read you romance novels until your Audible free trial expires.

Lauren Kunze is CEO of Pandorabots, maker of conversational AI Kuki.

More