Rethinking AI benchmarks: A new paper challenges the status quo of evaluating artificial intelligence

In recent years, artificial intelligence (AI) has made remarkable progress in performing complex tasks that were once considered the domain of human intelligence. From passing the bar exam and acing the SAT to mastering language proficiency and diagnosing medical images, AI systems such as GPT-4 and PaLM 2 have surpassed human performance on various benchmarks.

Benchmarks are essentially standardized tests that measure the performance of AI systems on specific tasks and goals. They're widely used by researchers and developers to compare and improve different models and algorithms; however, a new paper published in Science challenges the validity and usefulness of many existing benchmarks for evaluating AI systems.

The paper argues that benchmarks often fail to capture the real capabilities and limitations of AI systems, and can lead to false or misleading conclusions about their safety and reliability. For example, benchmarks may not account for how AI systems handle uncertainty, ambiguity, or adversarial inputs. They may also not reflect how AI systems interact with humans or other systems in complex and dynamic environments.

This poses a major challenge when making informed decisions about where these systems are safe to use. And given the growing pressure on enterprises to use advanced AI systems in their products, the community needs to rethink its approach to evaluating new models.

The need for aggregate metrics

To develop AI systems that are safe and fair, researchers and developers must make sure they understand what a system is capable of and where it fails.

“To build that understanding, we need a research culture that is serious about both robustness and transparency,” Ryan Burnell, AI researcher at the University of Cambridge and lead author of the paper, told VentureBeat. “But we think the research culture is been lacking on both fronts at the moment.”

One of the key problems that Burnell and his co-authors point out is the use of aggregate metrics that summarize an AI system’s overall performance on a category of tasks such as math, reasoning or image classification. Aggregate metrics are convenient because of their simplicity. But the convenience comes at the cost of transparency and lack of detail on some of the nuances of the AI system’s performance on critical tasks.

“If you have data from dozens of tasks and maybe thousands of individual instances of each task, it’s not always easy to interpret and communicate those data. Aggregate metrics allow you to communicate the results in a simple, intuitive way that readers, reviewers or — as we’re seeing now — customers can quickly understand,” Burnell said. “The problem is that this simplification can hide really important patterns in the data that could indicate potential biases, safety concerns, or just help us learn more about how the system works, because we can’t tell where a system is failing.”

There are many ways aggregate benchmarks can go wrong. For example, a model might have acceptable overall performance on an aggregate benchmark but perform poorly on a subset of tasks. A study of commercial facial recognition systems found that models that had a very high overall accuracy performed poorly on darker-skinned faces. In other cases, the model might learn the wrong patterns, such as detecting objects based on their backgrounds, watermarks or other artifacts that are not related to the main task. Large language models (LLM) can make things even more complicated.

“With large language models becoming more and more general-purpose, this problem is getting worse because the range of capabilities we need to evaluate is getting broader,” Burnell said. “This means that when we aggregate all the data, we’re combining apples and oranges in a way that doesn’t make sense.”

According to several studies, LLMs that perform well on complicated tasks fail badly at much simpler tasks, such as solving complicated math problems but providing wrong answers if the same problem is posed in a different way. Other studies show that the same models fail at elementary problems that a human would need to master before learning more complex tasks.

“The broader problem here is that we could become overconfident in the capabilities of our systems and deploy them in situations where they aren’t safe or reliable,” Burnell said.

For example, one of the highly advertised achievements of the GPT-4 technical report is the model’s ability to pass a simulated bar exam and score in the top 10% of the test takers. However, the report does not provide any details on which questions or tasks the model failed at.

“If those tasks are highly important or come up frequently, we might not want to trust the system in such a high-stakes context,” Burnell said. “I’m not saying that ChatGPT can’t be useful in legal contexts, but just knowing that it scores 90th percentile on the bar exam is insufficient to make informed decisions about this issue.”

Granular data can improve AI evaluation

Another problem that Burnell and his co-authors highlight in their paper is the lack of instance-by-instance evaluation reporting. Without access to granular data on the examples used to test the model, it will be very difficult for independent researchers to verify or corroborate the results published in papers.

“Evaluation transparency is really important from an accountability perspective … it’s really important that the community has a way of independently scrutinizing evaluation results to examine the robustness of systems and check for any failure points or biases,” Burnell said. “But making evaluation results public also provides a lot of value from a scientific perspective.”

However, getting access to instance-by-instance evaluation is getting increasingly difficult. According to one study, only a small percentage of papers presented at top AI conferences provide granular access to test instances and results. And evaluating cutting-edge systems like ChatGPT and GPT-4 is becoming prohibitively expensive and time-consuming because of the costs of inference and the number of test examples needed.

Therefore, without this data, other researchers and policymakers are forced to either make considerable investments to perform their own tests, or take the reported results at face value. On the other hand, if the researchers made their evaluation data available to others, a lot of unnecessary costs could be saved. And with a growing number of platforms making it possible to upload evaluation results, it has become easier and much less costly to publish research data.

“Especially when it comes to the standardized benchmarks that are commonplace in AI, there are many different ways evaluation results could be used that the researchers conducting the initial evaluation might not think of,” Burnell said. “If the data are made public, other researchers can easily conduct supplemental analyses without having to waste time and money on recreating the evaluation.”

Where is the field headed?

Burnell and his co-authors provide several guidelines to help address the problem of better understanding and evaluating AI systems. Best practices include publishing granular performance reports with breakdowns across features of the problem space. The community should also work on new benchmarks that can test specific capabilities instead of aggregating several skills into a single measure. And researchers should be more transparent in recording their tests and making them available to the community.

“In general, the academic community is moving in the right direction — for example, conferences and journals are starting to recommend or require the uploading of code and data alongside submitted papers,” Burnell said.

Burnell noted that some companies such as Hugging Face and Meta are “working hard to stay in line with the best practices recommended by the wider community,” such as open-sourcing data and models and releasing model cards that explain how a model was trained.

But at the same time, the commercial AI market is moving toward less sharing and transparency.

“We have companies like OpenAI who are starting to monetize the use of their models and are essentially switching from conducting scientific research to doing product development,” Burnell said. “These companies clearly believe that in order to keep their competitive edge they need to keep the details of how their models are built and trained secret. And honestly, I don’t think they are wrong about that.”

However, Burnell also warns that this new culture will incentivize companies to sweep the limitations and failures of their models under the rug and cherry-pick evaluation results that make it seem like their models are incredibly capable and reliable.

“Given how popular these models are becoming and the incredibly broad range of things they could be used for, I think that’s potentially a very dangerous situation for us to be in, and I’m concerned about our ability to properly understand the capabilities and limitations of these systems,” Burnell said. “I think we need to push hard to make sure independent groups can get access to these systems in order to properly evaluate them, and that regulatory solutions are probably an important piece of the puzzle here.”

The need for aggregate metrics

Granular data can improve AI evaluation

Where is the field headed?

More