Language models struggle to answer questions without paraphrasing training data

The task of long-form question answering (LFQA) involves retrieving documents relevant to a given question and using them to generate a paragraph-length answer to that question. While many machine learning models have recently been proposed for LFQA, the work remains challenging, as a recent paper coauthored by University of Massachusetts Amherst and Google researchers demonstrates.

The researchers developed an LFQA system that achieves state-of-the-art performance on a popular dataset. But they found that even the best LFQA models, including theirs, don't always answer in a way that's grounded in -- or demonstrates an understanding of -- the documents they retrieve.

Large language models like OpenAI's GPT-3 and Google's GShard learn to write humanlike text by internalizing billions of examples from the public web. Drawing on sources like ebooks, Wikipedia, and social media platforms like Reddit, they make inferences to complete sentences and even whole paragraphs. But studies demonstrate the pitfall of this training approach. Open-domain question-answering models -- models theoretically capable of responding to novel questions with novel answers -- often simply memorize answers found in the data on which they're trained, depending on the data set. Because of this, language models can also be prompted to show sensitive, private information when fed certain words and phrases.

In this most recent study, the coauthors evaluated their LFQA model on ELI5, a Python library that allows developers to visualize and debug machine learning models using a unified API. There was significant overlap between the data used to train and test the model; as high as 81% were given in paraphrased form. And the researchers say that this reveals issues with the model in addition to ELI5.

"[Our] in-depth analysis reveals [shortcomings] not only with our model, but also with the ELI5 dataset and evaluation metrics. We hope that the community works towards solving these issues so that we can climb the right hills and make meaningful progress," they wrote in the paper.

Memorization isn't the only challenge large language models struggle with. Recent research shows that even state-of-the-art models struggle to answer the bulk of math problems correctly. For example, a paper published by researchers at the University of California, Berkeley finds that large language models including OpenAI's GPT-3 can only complete 2.9% to 6.9% of problems from a dataset of over 12,500. OpenAI itself notes that its flagship language model, GPT-3, places words like "naughty" or "sucked" near female pronouns and "Islam" near words like "terrorism." A paper by Stanford University Ph.D. candidate and Gradio founder Abubakar Abid detailed the anti-Muslim tendencies of text generated by GPT-3. And the Middlebury Institute of International Studies' Center on Terrorism, Extremism, and Counterterrorism claims that GPT-3 could reliably generate "informational" and "influential" text that might "radicalize individuals into violent far-right extremist ideologies and behaviors."

Among others, leading AI researcher Timnit Gebru has questioned the wisdom of building large language models, examining who benefits from them and who's disadvantaged. A paper coauthored by Gebru earlier this year spotlights the impact of large language models' carbon footprint on marginalized communities and such models' tendency to perpetuate abusive language, hate speech, microaggressions, stereotypes, and other dehumanizing language aimed at specific groups of people.

More