Meta's Gaia2 pushes beyond tool accuracy and user preference to test real-world robustness

A persistent problem with evaluating agents is how to measure their performance in real-world scenarios.

Despite other benchmarks attempting to address this issue, Meta researchers believe that a more realistic evaluation method is needed for agents, one that tests their adaptability in real-life scenarios.

Enter Meta’s new evaluation platform, Agents Research Environment (ARE), and a new benchmark model within it called Gaia2. ARE “supports the running of orchestrations, creation of environments, and connection of synthetic or real-world apps for agent development and evaluations.”

Meanwhile, Gaia2, an upgrade of Meta’s earlier Gaia agent benchmark, measures agent performance on the ARE environment.

Meta’s idea is that current testing environments for agents often need to play catch-up, so evaluators constantly need to tweak the benchmarks.

“We posit that model improvement through experience and deployment in production is bounded by the controllability, diversity, and realism of available environments. First, while the web is a great environment for supporting agent tasks like search, it is constantly evolving, making reproducibility for evaluation and study of complex behaviors challenging, in particular those involving write operations,” Meta’s paper said.

Research environments

The idea behind ARE is that it’s an environment built like the real world where an agent will have to interact with. Tasks on ARE will be asynchronous, and actual time will pass, and the agents deployed in the environment must adapt and work based on these constraints.

It has five core foundations:

Apps that are stateful API interfaces that touch data sources, for example, email apps, have tools like send_email
Environments, or the collection of apps, data and rules
Events are anything that happens in the environment
Notifications or messages that inform the agent about events and
Scenarios that act as the initial state and events in the environment, and can include a verification mechanism

An enterprise that wants to evaluate an agent can build that testing scenario on ARE, which Meta offers as an open-source framework on GitHub, including the core simulation engine, example environments and default orchestration. They can choose to either build their own environment or a pre-loaded one, and after defining the apps with which the agents will interact. Enterprises will then set up scenarios for the agent before connecting the agents they want to test. They will then begin running their orchestration logic and set up their verifier.

Gaia2 benchmarks

Critical to the usefulness of the ARE is the Gaia2 benchmark. Gaia2 is built on ARE and measures the capabilities of agents, as opposed to Gaia1’s testing of an agent’s ability to find answers.

It examines how agents perform within ARE and compares their handling of changing conditions, meeting deadlines, managing API failures and clarifying tasks when instructions are unclear. Gaia2 supports several protocols, like Agent2Agent, to also evaluate an agent’s ability to collaborate. It uses an LLM-as-a-judge framework.

Since ARE evaluations run asynchronously, and time continues to pass even if the agent isn’t running, Gaia2 can measure if the idle agent responds when a new event is sent.

It tested agents in a mobile environment across 1,120 tasks.

Based on current testing and a post from Hugging Face CEO Clem Delangue, OpenAI’s GPT-5 is currently leading the Gaia2 benchmark.

Already, the Gaia2 benchmark is gaining fans.

Agents and real-life scenarios

Enterprises want to ensure their agents actually work, but that can be difficult to do in static tests that don’t really reflect what the agents would be doing.

Some benchmarks and evaluations have recently been released that aim to provide real-life simulated environments. Hugging Face’s Yourbench lets enterprises build their own testing environments using real data. At the same time, MCPEval from Salesforce unleashes agents into real MCP servers that don’t rely on static, pre-defined scenarios. Inclusion Arena from Inclusion AI also measures agentic performance in real-world scenarios.

However, Gaia2 differs because it tests adaptability and how it handles “noise.” Inclusion Arena, for example, evaluates human preferences and how closely the agent follows these instructions. MCPEval, on the other hand, measures agents’ abilities to call tools.

ARE and Gaia2 offer enterprises another means of evaluating agentic performance, allowing them to see how robust their agent is when an unexpected event occurs.

Research environments

Gaia2 benchmarks

Agents and real-life scenarios

More