Research shows natural language benchmarks don't measure AI models' general knowledge well

Open-domain question-answering models -- models theoretically capable of responding to novel questions with novel answers -- often simply memorize answers found in the data on which they're trained, depending on the data set. That's the assertion of a team of researchers affiliated with Facebook and the University College London, who in a preprint paper present evidence that 60%-70% of answers given by models tested on open-domain benchmarks are embedded somewhere in the training sets.

Open-domain question-answering has received attention in the AI community for its practical applications, and more recently as a method to analyze language models' grasp of factual knowledge. But a deep understanding of what kinds of questions models can answer remains elusive; unknowns about how questions and answers are distributed in benchmark corpora make it hard to contextualize the results.

In their study, the researchers sought to evaluate the test sets of popular open-domain question-answering data sets including WebQuestions, TriviaQA, and Open Natural Questions. They identified classes of question a model should be able to answer and annotated 1,000 question-answer pairs from each test set for repeated questions in their respective training sets. Then they computed the performance of several models on the benchmarks using open-book (which leverage retrieval from a large corpus of documents) and closed-book approaches (which focus on training large models with no external knowledge).

The three data sets in question aren't much alike, which was the point -- testing across all three guaranteed robustness. WebQuestions contains 3,778 training and 2,032 test question-answer pairs from a search engine, while TriviaQA has 78,785 training and 11,313 test question-answer pairs from free trivia websites. Meanwhile, Open Natural Questions comprises 79,168 training and 3,610 question-answer pairs from a combination of search engines and Wikipedia articles.

The team theorizes open-domain question-answering models should be able to (1) recall the answer to a question seen at training time, (2) answer novel questions at test time and choose an answer from the set of answers seen during training, and (3) answer novel questions that have answers not contained within the training data set. To determine whether the aforementioned benchmarks measure any of these behaviors, the coauthors split the test data in each corpus by whether the answers appeared somewhere in the training sets. Around 58%-71% of test answers were also somewhere in the training data, according to the researchers, demonstrating that the majority of the test data didn't probe for answer generalization.

The team also probed the benchmarks for paraphrased questions in training data, using the set of 1,000 annotated questions. They say that 28%-34% of the questions were paraphrased, the majority being near-duplicates differing only by one or two words. "This result implies that 30% of the test set of these datasets only probe for how well models can simply memorize question-answer pairs seen at training," the coauthors wrote.

The researchers selected several "open book" models -- dense passage retrieval, retrieval-augmented generation, and fusion-in-decoder -- and "closed book" models (Facebook's BART and Google's T5) to test, as well as nearest-neighbor models that store all available answers and classify new answers based on a similarity measure. Results on the benchmark corpora imply that all models memorized questions well, with an untrained nearest-neighbor model answering 20% of the test questions correctly. But they performed poorly on questions that couldn't be memorized from training sets, with a mean absolute performance difference of 63% between repeated and non-repeated data. And when it came to generalization, one model that reliably memorized questions -- T5 -- struggled, achieving only a 22% match score.

"It is clear that performance on these data sets cannot be properly understood by overall question-answer accuracy," the researchers wrote. "We suggest that in future, a greater emphasis be placed on more behavior-driven evaluation rather than pursuing single-number overall accuracy figures."

More