AI models are becoming better at answering questions, but they're not perfect

Let the OSS Enterprise newsletter guide your open source journey! Sign up here.

Late last year, the Allen Institute for AI, the research institute founded by the late Microsoft cofounder Paul Allen, quietly open-sourced a large AI language model called Macaw. Unlike other language models that've captured the public's attention recently (see OpenAI's GPT-3), Macaw is fairly limited in what it can do, only answering and generating questions. But the researchers behind Macaw claim that it can outperform GPT-3 on a set of questions, despite being an order of magnitude smaller.

Answering questions might not be the most exciting application of AI. But question-answering technologies are becoming increasingly valuable in the enterprise. Rising customer call and email volumes during the pandemic spurred businesses to turn to automated chat assistants -- according to Statista, the size of the chatbot market will surpass $1.25 billion by 2025. But chatbots and other conversational AI technologies remain fairly rigid, bound by the questions that they were trained on.

Today, the Allen Institute released an interactive demo for exploring Macaw as a complement to the GitHub repository containing Macaw's code. The lab believes that the model's performance and "practical" size -- about 16 times smaller than GPT-3 -- illustrates how the large language models are becoming "commoditized" into something much more broadly accessible and deployable.

Answering questions

Built on UnifiedQA, the Allen Institute's previous attempt at a generalizable question-answering system, Macaw was fine-tuned on datasets containing thousands of yes/no questions, stories designed to test reading comprehension, explanations for questions, and school science and English exam questions. The largest version of the model -- the version in the demo and that's open-sourced -- contains 11 billion parameters, significantly fewer than GPT-3's 175 billion parameters.

Given a question, Macaw can produce an answer and an explanation. If given an answer, the model can generate a question (optionally a multiple-choice question) and an explanation. Finally, if given an explanation, Macaw can give a question and an answer.

"Macaw was built by training Google’s T5 transformer model on roughly 300,000 questions and answers, gathered from several existing datasets that the natural-language community has created over the years," the Allen Institute's Peter Clark and Oyvind Tafjord, who were involved in Macaw's development, told VentureBeat via email. "The Macaw models were trained on a Google cloud TPU (v3-8). The training leverages the pretraining already done by Google in their T5 model, thus avoiding a significant expense (both cost and environmental) in building Macaw. From T5, the additional fine-tuning we did for the largest model took 30 hours of TPU time."

In machine learning, parameters are the part of the model that’s learned from historical training data. Generally speaking, in the language domain, the correlation between the number of parameters and sophistication has held up remarkably well. But Macaw punches above its weight. When tested on 300 questions created by Allen Institute researchers specifically to "break" Macaw, Macaw outperformed not only GPT-3 but the recent Jurassic-1 Jumbo model from AI21 Labs, which is even larger than GPT-3.

According to the researchers, Macaw shows some ability to reason about novel hypothetical situations, allowing it to answer questions like "How would you make a house conduct electricity?" with "Paint it with a metal paint." The model also hints at awareness of the role of objects in different situations and appears to know what an implication is, for example answering the question "If a bird didn't have wings, how would it be affected?" with "It would be unable to fly."

But the model has limitations. In general, Macaw is fooled by questions with false presuppositions like "How old was Mark Zuckerberg when he founded Google?" It occasionally makes errors answering questions that require commonsense reasoning, such as "What happens if I drop a glass on a bed of feathers?" (Macaw answers "The glass shatters"). Moreover, the model generates overly brief answers; breaks down when questions are rephrased; and repeats answers to certain questions.

The researchers also note that Macaw, like other large language models, isn't free from bias and toxicity, which it might pick up from the datasets that were used to train it. Clark added: "Macaw is being released without any usage restrictions. Being an open-ended generation model means that there are no guarantees about the output (in terms of bias, inappropriate language, etc.), so we expect its initial use to be for research purposes (e.g., to study what current models are capable of)."

Implications

Macaw might not solve the current outstanding challenges in language model design, among them bias. Plus, the model still requires decently powerful hardware to get up and running -- the researchers recommend 48GB of total GPU memory. (Two of Nvidia's 3090 GPUs, which have 24GB of memory each, cost $3,000 or more -- not accounting for the other components needed to use them.) But Macaw does demonstrate that, to the Allen Institute's point, capable language models are becoming more accessible than they used to be. GPT-3 isn't open source, but if it was, one estimate pegs the cost of running it on a single Amazon Web Services instance at a minimum of $87,000 per year.

Macaw joins other open source, multi-task models that have been released over the past several years, including EleutherAI's GPT-Neo and BigScience's T0. DeepMind recently showed a model with 7 billion parameters, RETRO, that it claims can beat others 25 times its size by leveraging a large database of text. Already, these models have found new applications and spawned startups. Macaw -- and other question-answering systems like it -- could be poised to do the same.