Researchers find cutting-edge language models fall short in basic reasoning

Even sophisticated language models such as OpenAI's GPT-3 struggle with socially important topics like morality, history, and law. That's the top-line finding from a new paper coauthored by Columbia, University of Chicago, and University of California, Berkeley researchers that proposes a 57-task test to measure models' ability to reason. Models must possess problem-solving abilities and extensive knowledge about the world to perform well on the test. But in experiments, the coauthors found that the models they benchmarked -- including GPT-3 -- frequently didn't know when they were wrong.

The goal of the novel test set is to bridge the gap between the knowledge that models see during training and existing measures of success in natural language processing. Like all machine learning models, language models learn patterns from vast data sets often sourced from Wikipedia, Reddit, ebooks, and other web sources. Some recently introduced benchmarks attempt to capture the linguistic skills of models, but so far, there's little evidence to suggest a correlation between benchmark performance and a model's grasp of commonsense reasoning.

The researchers claim their test is different in that it assesses models across subjects humans commonly learn, like mathematics, history, and ethics. To craft it, graduate and undergraduate students collected 15,908 questions from freely available sources online, including practice exams for undergraduate courses, quizzes for readers of Oxford University Press publications, and tests like the Graduate Record Examination, U.S. Medical Licensing Examination, and Examination for Professional Practice in Psychology. The tasks range in difficulty from an elementary level to an "advanced professional level," a sampling the coauthors argue is sufficient for identifying a model's blind spots.

"We measure arbitrary real-world text understanding," they wrote, noting that each subject contains at least 100 test examples. "Since models are pretrained on the internet, this enables us to test how well they can extract useful knowledge from massive corpora."

In addition to GPT-3, the researchers benchmarked Google's T5 and the Allen Institute for AI's UnifiedQA question-answering model against their test set. The results show that meaningful progress has only become possible in recent months, with models containing up to 13 billion parameters achieving 25% accuracy and 175-billion-parameter models like GPT-3 reaching 43.9% accuracy. (Parameters are parts of the model learned from historical training data.) But that being the case, GPT-3 failed to excel at any single subject; its performance was on the test set was lopsided, with almost 70% accuracy for its best subject (U.S. foreign policy) but "near-random" performance for several other subjects (e.g., college chemistry).

"Overall, GPT-3 does poorly on highly procedural problems," the researchers explained. "It is notably poor at modeling human (dis)approval, as evident by the low performance on the professional law and moral scenarios tasks, [and it] also has difficulty performing calculations, so much so that it exhibits poor performance on elementary mathematics and many other STEM subjects with 'plug and chug' problems ... We speculate that is in part because GPT-3 acquires declarative knowledge more readily than procedural knowledge."

The findings imply that current models have room for improvement, but it's unclear whether existing techniques will suffice. As the researchers point out, previous research indicates that a 10 times increase in model size must be accompanied by an approximately 5 times increase in data, which might be logistically prohibitive.

"Aside from the tremendous expense in creating multi-trillion parameter language models, data may also become a bottleneck," the researchers continued. "There is far less written about esoteric branches of knowledge than about everyday text."

More