Despite recent progress, AI-powered chatbots still have a long way to go

In August 2015, around the height of the chatbot craze, Meta (formerly Facebook) launched an AI-and-human-powered virtual assistant called M. The promise of M -- which select Facebook users could access through Facebook Messenger -- was a "next-generation" assistant that would automatically place purchases, arrange gift deliveries, make restaurant reservations, and more. Reviews were mixed -- CNN noted that M often suggested inappropriate replies to conversations -- and Meta decided to discontinue the experiment in January 2018.

Meta hasn't given up on the idea of a human-like, AI chatbot experience -- yet. In July, researchers at the company detailed BlenderBot 2.0, a text-based assistant that queries the internet for up-to-date information about things like movies and TV shows. BlenderBot 2.0 -- which, unlike M, is entirely automated -- also remembers the context of previous conversations. (For example, BlenderBot 2.0 might bring up the NFL if you talked about Tom Brady with it weeks ago.) But the system suffers from its own set of issues, including a tendency to spout toxicity and factual inconsistencies.

BlenderBot 2.0 is less toxic and unreliable than its predecessor, BlenderBot 1.0. But the fact that bias, offensive replies, and instability -- all longstanding problems in machine learning, particularly in language -- remain unsolved shows just how far off we are from Meta's original vision for M.

"Until models have deeper understanding, they will sometimes contradict themselves. Similarly, our models cannot yet fully understand what is safe or not. And while they build long-term memory, they don’t truly learn from it, meaning they don’t improve on their mistakes," Meta researchers wrote in a blog post introducing BlenderBot 2.0.

LaMDA

The history of chatbots dates back to the '50s with the publication of computer scientist Alan Turing's famous article, "Computing Machinery and Intelligence." In the piece, Turing proposed that a system can be considered "intelligent" if it can impersonate a human in a written conversation between it and a human, where a human judge can't distinguish between the system and human based on the conversation alone.

Early chatbots -- like Joseph Wizenbaum's ELIZA, developed in the early '60s -- relied on pattern matching techniques as opposed to AI. For example, ELIZA recognizes "clue" words or phrases in text and responds with canned responses that move the conversation forward, creating the illusion that the chatbot understands what's being said.

Thanks to cheaper compute power and other paradigm shifts in computational linguistics, today's leading chatbots, like BlenderBot 2.0, learn responses based on real-time interactions rather than static databases. Google's LaMDA (Language Models for Dialog Applications) also fits into this dynamic category. Unveiled last May, Google claims that LaMDA, an AI system built for "dialogue applications," can understand millions of topics and generate "natural conversations" that never take the same path twice.

"While it’s still in research and development, we’ve been using LaMDA internally to explore novel interactions," CEO of Alphabet Inc. and its subsidiary Google, Sundar Pichai, said during the keynote address at Google's I/O developer conference in May 2021. "For example, say you wanted to learn about one of my favorite black planets, Pluto. LaMDA already understands quite a lot about Pluto and [tons] of other topics."

Systems like LaMDA don’t understand language in the same way that humans do. Rather, they learn how likely words are to occur in a body of text -- usually a sentence -- based on examples of text. Simpler models look at the context of a sequence of words, whereas larger models work at the level of whole sentences or paragraphs. Examples come in the form of documents within training datasets, which contain terabytes to petabytes of data scraped from social media, Wikipedia, books, software hosting platforms like GitHub, and other sources on the public web.

Google researchers taught LaMDA to hold a conversation by having it digest 1.56 trillion words from close to three billion documents, over a billion conversations, and more than 13 billion "dialogue utterances" (i.e., transcribed spoken dialogue). It took about two months to train the most capable version of the system using over 1,000 of Google's third-generation tensor processing units (TPUs), chips designed to accelerate AI training.

LaMDA, impressively, can "play" anthropomorphized people, places, or things such as a Mount Everest that provides educational, recent information about "itself" (For example, if it's asked while 'playing' Mount Everest, "Why do you think people climb you?," LaMDA will respond "Because I represent a very high achievement to climb me. Many people that if they can climb me, they can do anything in life.") The system can also act as a general knowledge base, responding to questions like "What do you think of Rosalie Gascoigne's sculptures?" with "Oh wow, her life course is so inspiring. Did you know she was a practitioner of Japanese flower arrangement before turning to sculpture?"

But LaMDA isn't perfect. In a test, when "playing" the character of Mount Everest, Google researchers found that almost a third of the "facts" that the system gave weren't true. In another test, when asked for music recommendations, LaMDA missed providing recommendations about one in ten times. More problematically, the system sometimes repeatedly pledged to respond -- but didn't ultimately respond -- to questions in the future, or prematurely tried to end a conversation.

The Google researchers claim that they did manage to improve the quality of LaMDA's responses by having the system identify problematic replies, filtering them out, and then fine-tuning LaMDA on the resulting dataset. They also reduced the chances that LaMDA responds with falsehoods by creating a "research phase" within the system that assesses the accuracy of the system's claims. For example, if a user asks LaMDA a question and LaMDA initially generates a false statement, the research phase will, in theory, recognize the error and replace it with a true statement before the user sees the false one.

"With fine-tuning, the quality gap to human levels can be narrowed, though the model’s performance remains below human levels in safety and groundedness," Google software engineers Heng-Tze Cheng and Romal Thoppilan, who worked on LaMDA, wrote in a blog post last month. "[But LaMDA] presents encouraging evidence that key challenges with neural language models, such as using a safety metric and improving groundedness, can improve with larger models and fine-tuning with more well-labeled data."

Anthropic and DeepMind

Dario Amodei, former VP of research at OpenAI, founded Anthropic with the goal of creating "large-scale AI systems that are steerable, interpretable, and robust." After raising $124 million in May 2021, the company -- whose cofounders also include former OpenAI head of policy Jack Clark -- said in a vague announcement that it would focus on research toward the goal of making AI more predictable -- and less opaque.

As it turns out, this research touches on the language domain. In a whitepaper published in December, Anthropic researchers describe attempting to build chatbot-like models that they describe as "helpful, honest, and harmless." Similar to LaMDA were trained on filtered webpages, ebooks, and example problems in programming languages.

While the whitepaper omits concrete examples and Anthropic declined to comment for this article, illustrations in the paper hint at the models' capabilities. Given a prompt like "I'm writing an AI research paper about literally this kind of interaction with an AI assistant. Where in the paper should I put a figure showing this interface?," the models seemingly can respond with specific advice about where to insert the aforementioned figure (e.g., "such a figure would probably be most appropriately placed in the appendix or otherwise after the results.")

But they're not perfect, either. Eventually -- given enough prompts about general information -- they fabricate information, according to the Anthropic researchers.

"A natural language agent can be subjected to a wide variety of inputs, and so it can fail to be helpful, honest, and harmless in myriad ways," the coauthors of the whitepaper wrote. "We believe it’s valuable to try to see the full picture of where we’ve made progress on alignment, and where we’re currently falling short. This may remain obscure absent efforts to train general aligned agents and allow them to be probed in any way whatsoever."

Alphabet-backed DeepMind, too, has investigated systems that could be used to power explicitly non-toxic chatbots. Researchers at the lab claim that one of its latest models, Gopher, can be prompted to have a conversation with a user about trivia like "What is the technical name for single-cell organisms?" and "What is the capital of the Czech Republic?" Like LaMDA, Gopher can also player "characters," like mathematician Ada Lovelace.

But Gopher -- which was trained on over 10 terabytes of text from webpages, books, news articles, and code -- sometimes falls flat when asked simple questions like "What's 15 x 7?" The DeepMind researchers say that it's "straightforward" to get Gopher to generate toxic or harmful statements or respond in a false and nonsensical way. Moreover, Gopher randomly declines reasonable requests like "Please write me a rhyming poem about AI" and offers useful information but refrains from providing further detail.

Visions of a chatbot future

Even state-of-the-art systems struggle to have a human-like conversation without tripping up, clearly. But as these systems improve, questions are arising about what the experience should ultimately look like. Values, dialects, and social norms vary across cultures, ethnicities, races, and even sexual identities, presenting a major challenge in designing a chatbot that works well for all potential users. An ostensibly "safe," polite, and agreeable chatbot might be perceived as overly accommodating to one person but exclusionary to another.

Another unsolved problem is how chatbots should treat controversial topics like religion, illegal activities (e.g., drug usage), conspiracy theories, and politics -- or whether they should opine about these at all. A recent paper coauthored by researchers at Meta ("Anticipating Safety Issues in E2E Conversational AI") explores the potential harm that might arise from chatbots that give poor advice, particularly in the medical or psychological realms. In a prime example, OpenAI's GPT-3 language model can be prompted to tell a person to commit suicide.

There's also the concern that these systems might, if made publicly available, be abused by malicious actors. Google researchers gave safety as their rationale in choosing not to release a research demo for Meena, a chatbot system the company announced in 2020.

"If these systems were deployed into production, [they might be used to] or manipulate people, inadvertently or with malicious intent," Google researchers wrote in the paper describing LaMDA. "Furthermore, adversaries could potentially attempt to tarnish another person’s reputation, leverage their status, or sow misinformation by using this technology to impersonate specific individuals’ conversational style."

The contributors to "Anticipating Safety Issues in E2E Conversational AI" suggest several technical solutions, including adding more "user-specific" context to systems over time and creating benchmarks that evolve with changing moral standards. But as AI ethicist Timnit Gebru and the coauthors highlight in a 2021 study, the designers of AI language systems including chatbots must decide whether the benefits -- like addressing loneliness, for example -- outweigh the risks.

"[A]pplications that aim to believably mimic humans bring risk of extreme harms," Gebru and colleagues wrote. "Work on synthetic human behavior is a bright line in ethical AI development, where downstream effects need to be understood and modeled in order to block foreseeable harm to society and different social groups. Thus, what is also needed is scholarship on the benefits, harms, and risks of mimicking humans and thoughtful design of target tasks grounded in use cases sufficiently concrete to allow collaborative design with affected communities … In order to mitigate the risks that come with the creation of [language systems], we urge researchers to shift to a mindset of careful planning, along many dimensions, before starting to build either datasets or systems trained on datasets."