Why exams intended for humans might not be good benchmarks for LLMs like GPT-4

As tech companies continue to roll out large language models (LLM) with impressive results, measuring their real capabilities is becoming more difficult. According to a technical report released by OpenAI, GPT-4 performs impressively on bar exams, SAT math tests, and reading and writing exams.

However, tests designed for humans may not be good benchmarks for measuring LLMs’ capabilities. Language models encompass knowledge in intricate ways, sometimes producing results that match or exceed average human performance. However, the way they obtain the knowledge and use it is often incompatible with that of humans. That can lead us to draw wrong conclusions from test results.

For LLMs like GPT-4, exam success lies in the training data

Arvind Narayanan, computer science professor at Princeton University, and Sayash Kapoor, Ph.D. candidate at Princeton University, recently wrote an article on the problems with testing LLMs on professional licensing exams.

One of these problems is “training data contamination.” This happens when a trained model is tested on the data it has been trained with. With too much training, a model might memorize its training examples and perform very well on them, giving the impression that it has learned the task. But it will fail on new examples.

Machine learning engineers go to great pains to separate their training and testing data. But with LLMs, things become tricky because the training corpus is so large that it’s hard to make sure your test examples are not somehow included in the training data.

“Language models are trained on essentially all of the text on the internet, so even if the exact test data isn't in the training corpus, there will be something very close to it,” Narayanan told VentureBeat. “So when we find that an LLM performs well on an exam or a programming challenge, it isn't clear how much of that performance is because of memorization versus reasoning.”

For example, one experiment showed that GPT-4 performed very well on Codeforces programming challenges created before 2021, when its training data was gathered. Its performance dropped dramatically on more recent problems. Narayanan found that in some cases, when GPT-4 was provided the title of a Codeforces problem, it could produce the link to the contest where it appeared.

In another experiment, computer scientist Melanie Mitchell tested ChatGPT’s performance on MBA tests, a feat that was widely covered in the media. Mitchell found that the model’s performance on the same problem could vary substantially when the prompt was phrased in slightly different ways.

“LLMs have ingested far more text than is possible for a human; in some sense, they have ‘memorized’ (in a compressed format) huge swaths of the web, of Wikipedia, of book corpora, etc.,” Mitchell told VentureBeat. “When they are given a question from an exam, they can bring to bear all the text they have memorized in this form, and can find the most similar patterns of ‘reasoning’ that can then be adapted to solve the question. This works well in some cases but not in others. This is in part why some forms of LLM prompts work very well while others don’t.”

Humans solve problems in a different way

Humans gradually build their skills and knowledge in layers through years of experience, study and training. Exams designed for humans assume that the test-taker already possesses these preparatory skills and knowledge, and therefore do not test them thoroughly. On the other hand, language models have proven that they can shortcut their way to answers without the need to acquire prerequisite skills.

“Humans are presumably solving these problems in a different, more generalizable way. Thus we can’t make the assumptions for LLMs that we make for humans when we give them tests,” Mitchell said.

For instance, part of the background knowledge for zoology is that each individual is born, lives for a while and dies, and that the length of life is partly a function of species and partly a matter of the chances and vicissitudes of life, says computer scientist and New York University professor Ernest Davis.

“A biology test is not going to ask that, because it can be assumed that all the students know it, and it may not ask any questions that actually require that knowledge. But you had better understand that if [you’re going to be] running a biology lab or a barnyard,” Davis told VentureBeat. “The problem is that there is background knowledge that is actually needed to understand a particular subject. This generally isn't tested on tests designed for humans because it can pretty well be assumed that people know [it].”

The lack of these foundational skills and knowledge is evident in other instances, such as an examination of large language models in mathematics that Davis carried out recently. Davis found that LLMs fail at very elementary math problems posed in natural language. This is while other experiments, including the technical report on GPT-4, show that LLMs score high on advanced math exams.

How far can you trust LLMs in professional tasks?

Mitchell, who further tested LLMs on bar exams and medical school exams, concludes that exams designed for humans are not a reliable way to figure out these AI models’ abilities and limitations for real-world tasks.

“This is not to say that enormous statistical models like LLMs could never reason like humans — I don’t know whether this is true or not, and answering it would require a lot of insight into how LLMs do what they do, and how scaling them up affects their internal mechanisms,” Mitchell said. “This is insight which we don’t have at present.”

What we do know is that such systems make hard-to-predict, non-humanlike errors, and “we have to be very careful when assuming that they can generalize in ways that humans can,” Mitchell said.

Narayanan said that an LLM that aces exams through memorization and shallow reasoning might be good for some applications, but can't do the range of things a professional can do. This is especially true for bar exams, which overemphasize subject matter knowledge and underemphasize real-world skills that are hard to measure in a standardized, computer-administered way.

“We shouldn't read too much into exam performance unless there is evidence that it translates into an ability to do real-world tasks,” Narayanan said. “Ideally we should study professionals who use LLMs to do their jobs. For now, I think LLMs are much more likely to augment professionals than replace them.”

For LLMs like GPT-4, exam success lies in the training data

Humans solve problems in a different way

How far can you trust LLMs in professional tasks?

More