Stanford debuts first AI benchmark to help understand LLMs

In the world of artificial intelligence (AI) and machine learning (ML), 2022 has arguably been the year of foundation models, or AI models trained on a massive scale. From GPT-3 to DALL-E, from BLOOM to Imagen — another day, it seems, another large language model (LLM) or text-to-image model. But until now, there have been no AI benchmarks to provide a standardized way to evaluate these models, which have developed at a rapidly-accelerated pace over the past couple of years.

LLMs have particularly captivated the AI community, but according to the Stanford Institute for Human-Centered AI (HAI)’s Center for Research on Foundation Models, the absence of an evaluation standard has compromised the community’s ability to understand these models, as well as their capabilities and risks.

To that end, today the CRFM announced the Holistic Evaluation of Language Models (HELM), which it says is the first benchmarking project aimed at improving the transparency of language models and the broader category of foundation models.

“Historically, benchmarks have pushed the community to rally around a set of problems that the research community believes are valuable,” Percy Liang, associate professor in computer science at Stanford University and director of the CRFM, told VentureBeat. “One of the challenges with language models, and foundation models in general, is that they're multipurpose, which makes benchmarking extremely difficult.”

HELM, he explained, takes a holistic approach to the problem by evaluating language models based on a recognition of the limitations of models; on multi-metric measurement; and direct model comparison, with a goal of transparency. The core tenets used in HELM for model evaluation include accuracy, calibration, robustness, fairness, bias, toxicity, and efficiency, pointing to the key elements that make a model sufficient.

Liang and his team evaluated 30 language models from 12 organizations: AI21 Labs, Anthropic, BigScience, Cohere, EleutherAI, Google, Meta, Microsoft, NVIDIA, OpenAI, Tsinghua University and Yandex. Some of these models are open-source to the public, others are available through commercial APIs, and others are private.

A 'comprehensive approach' to LLM evaluation

“I applaud the Stanford group's initiative,” Eric Horvitz, chief scientific officer at Microsoft, told VentureBeat by email. “They have taken a comprehensive approach to evaluating language models by creating a taxonomy of scenarios and measuring multiple aspects of performance across them.”

Benchmarking neural language models is crucial for directing innovation and progress in both industry and academia, he added.

“Evaluation is essential for advancing the science and engineering of neural models, as well as for assessing their strengths and limitations,” he said. “We conduct rigorous benchmarking on our models at Microsoft, and we welcome the Stanford team's comparative evaluation within their holistic framework, which further enriches our knowledge and insights.”

Stanford’s AI benchmark lays foundation for LLM standards

Liang says HELM lays the foundation for a new set of industry standards and will be maintained and updated as an ongoing community effort.

“It’s a living benchmark that is not going to be done, there are things that we’re missing and that we need to cover as a community,” he said. “This is really a dynamic process, so part of the challenge will be to maintain this benchmark over time.”

Many of the choices and ideas in HELM can serve as a basis for further discussion and improvement, agreed Horvitz.

“Moving forward, I hope to see a community-wide process for refining and expanding the ideas and methods introduced by the Stanford team,” he said. “There’s an opportunity to involve stakeholders from academia, industry, civil society, and government—and to extend the evaluation to new scenarios, such as interactive AI applications, where we seek to measure how well AI can empower people at work and in their daily lives.”

AI benchmarking project is a 'dynamic' process

Liang emphasized that the benchmarking project is a “dynamic” process. “When I tell you about the results, tomorrow they could change because new models are possibly coming out,” he said.

One of the main things that the benchmark seeks to do, he added, is capture the differences between the models. When this reporter suggested it seemed a bit like a Consumer Reports analysis of different car models, he said that “is actually a great analogy — it is trying to provide consumers or users or the public in general with information about the various products, in this case models.”

What is unique here, he added, is the pace of change. “Instead of being a year, it might be a month before things change,” he said, pointing to Galactica, Meta’s newly released language model for scientific papers, as an example.

“This is something that will add to our benchmark,” he said. “So it’s like having Toyota putting out a new model every month instead of every year.”

Another difference, of course, is the fact that LLMs are poorly understood and have such a “vast surface area of use cases,” as opposed to a car that is only driven. In addition, the automobile industry has a variety of standards — something that the CRFM is trying to build. “But we’re still very early in this process,” Liang said.

HELM AI benchmark is a 'Herculean' task

“I commend Percy and his team for taking on this Herculean task,” Yoav Shoham, co-founder at AI21 Labs, told VentureBeat by email. “It’s important that a neutral, scientifically-inclined [organization] undertake it.”

The HELM benchmark should be evergreen, he added, and updated on a regular basis.

“This is for two reasons,” he said. “One of the challenges is that it’s a fast-moving target and in many cases the models tested are out of date. For example, J1-Jumbo v1 is a year-old and J1-Grande v1 is 6-months-old, and both have newer versions that haven’t been ready for testing by a third-party.”

Also, what to test models for is notoriously difficult, he added. “General considerations such as perplexity (which is objectively defined) or bias (which has a subjective component) are certainly relevant, but the set of yardsticks will also evolve, as we understand better what actually matters in practice,” he said. “I expect future versions of the document to refine and expand these measurements.”

Shoham sent one parting note to Liang about the HELM benchmark: “Percy, no good deed goes unpunished,” he joked. “You’re stuck with it.”

A 'comprehensive approach' to LLM evaluation

Stanford’s AI benchmark lays foundation for LLM standards

AI benchmarking project is a 'dynamic' process

HELM AI benchmark is a 'Herculean' task

More