A new learning paradigm developed by University College London (UCL) and Huawei Noah’s Ark Lab enables large language model (LLM) agents to dynamically adapt to their environment without fine-tuning the underlying language model. The method allows agents to continuously improve their performance by using a structured memory system that updates itself as the agent gathers experience.

An implementation of the paradigm, which the researchers call Memento, has achieved top scores on key benchmarks for deep research and complex, multi-step reasoning tasks. For enterprises, this offers a scalable and efficient pathway for developing generalist LLM agents that are capable of continuous, real-time learning without the high cost and downtime associated with traditional training methods.

The limitations of current LLM agents

Current LLM agents typically follow one of two development paradigms, each with significant limitations for enterprise applications. The first approach involves building specialized frameworks with fixed, hard-coded workflows. While effective for narrow tasks, these agents are rigid and cannot adapt to new situations or incorporate new information after deployment.

The second paradigm involves updating the LLM itself through supervised fine-tuning or reinforcement learning. This allows for more flexible behavior but comes at a high computational cost and requires extensive data. According to the paper's authors, “These approaches are inefficient for continuous adaptation and online learning, impractical for agents deployed in open-ended scenarios.”

Jun Wang, a professor of computer science at UCL and co-author of the paper, argues that the issue with fine-tuning goes beyond cost. He notes that altering a model’s parameters can “compromise the knowledge acquired during pre-training.” This risk of degrading the model's core capabilities is a key motivation behind their work. 

An ideal LLM agent should be able to update its behavior as it interacts with its environment but without the costs of retraining the underlying model.

A new paradigm: Memory-based learning

Inspired by human memory, the researchers propose a memory-based learning framework that enables continual adaptation without modifying the LLM. Instead of fine-tuning the base model, agents leverage an external memory to store past experiences. When faced with a new task, the agent draws from similar past situations to guide its decision-making. 

This process builds on the Markov decision process (MDP), a classic framework in AI for teaching an agent to make optimal decisions. In a standard MDP, an agent observes the current state of its environment, chooses an action, and receives a reward or penalty. Its goal is to learn a strategy that maximizes its total rewards over time.

M-MDP

Memory-based Markov decision process (M-MDP) (source: arXiv)

The researchers formalize their new approach as a Memory-augmented MDP (M-MDP), which enhances this framework by allowing the agent to consider not just its current state and potential actions, but also a rich memory of past events.

The agent uses a technique called case-based reasoning (CBR), which retrieves and adapts solutions based on its experience with previous problems. For example, a research agent that has successfully completed a web-based task can use that experience to solve a new, structurally similar task it has never seen before. “Our method offers a novel path to continual learning for deep research agents – efficient, generalizable, and inspired by how humans learn,” the researchers write.

How Memento works

The researchers implemented this paradigm in an agent called Memento, designed for deep research tasks that require agents to interact with their environment, use external tools, retrieve information, and process diverse data for dynamic reasoning.

"We call for a new approach that allows for agent adaptability without altering LLM parameters," Wang told VentureBeat. "Memento aims to initiate this revolution."

The system has three main components: a planner and a tool-enabled executor that work in an alternating loop to complete tasks, and a growing "case bank" that stores past experiences.

Memento framework

Memento framework (source: arXiv)

In the planning stage, the planner (powered by an LLM-driven CBR agent) receives a task and queries the case bank for relevant past experiences. The retrieved cases are combined with the current task instruction to form a prompt, which guides the underlying LLM to break the task into subtasks and generate a step-by-step plan. This plan is passed on to the executor, which a general-purpose LLM powers.

As the executor works through each subtask, a “subtask memory” module tracks the progress and outcomes. After each step, the planner reviews the execution history to assess if the task is complete. If not, it reformulates the plan based on the updated context. Once the task is finished, the experience is saved to the case bank.

The executor uses the Model Context Protocol (MCP), a standardized interface that allows it to connect with a wide range of external tools flexibly. This includes search engines, web crawlers, and components for processing multimodal information like video, images, and various file formats.

The case bank itself is dynamic and comes in two variants. The non-parametric version retrieves cases based on semantic similarity, a method Wang likens to “collaborative filtering or similarity-based learning, where successful cases from the past inform solutions for current situations.”

The more advanced parametric version uses reinforcement learning with a lightweight neural network to address a common real-world challenge: sparse feedback. For tasks where success or failure signals are infrequent, this method helps the feedback “propagate through various stages,” ensuring the agent learns reliably over time. Wang refers to this as a “non-parametric approach in a broader sense,” as it provides “additional space for LLM agents to learn without changing the underlying parameters of the LLM.”

Memento in action

In experiments, the researchers used GPT-4.1 as the backbone planner and other models like o3 and o4-mini to power the executor. Memento demonstrated strong performance across several challenging benchmarks.

  • On the DeepResearcher dataset, which tests real-time web research and multi-hop reasoning, Memento nearly doubled the performance of a chain-of-thought (CoT) with retrieval-augmented generation (RAG) baseline, achieving a 66.6% F1 score.

  • On the GAIA benchmark, which assesses long-horizon planning and tool use, Memento achieved the top rank on the validation set and fourth on the test set, outperforming most existing open-source agent frameworks.

  • On Humanity’s Last Exam (HLE), a test of complex reasoning in specialized domains, Memento ranked second overall, performing close to GPT-5 and better than models like Gemini 2.5 Pro.

  • On SimpleQA, designed to measure factual accuracy and robustness against hallucination, Memento achieved the highest accuracy among all baselines.

Memento framework performance

Memento framework performance on key benchmarks (source: arXiv)

A new foundation for agent learning

While Memento uses a form of retrieval, Wang emphasizes that its core framework, the M-MDP, represents a significant step beyond standard RAG.

"While retrieval-based approaches or RAG limit learning and generalization, incorporating reinforcement learning enables the parameterization of memory, allowing generalization directly from memory," he explains.

This makes Memento's learning capability “orthogonal to the research on LLM itself.” In other words, the framework is not competing with advances in base models but is designed to leverage them. As LLMs become more powerful, agents built on the M-MDP framework will become even more effective learners. This approach also redefines how teams can build and deploy agents, creating what Wang calls "a new paradigm for prompt engineering and in-context learning" that brings "machine learners and software engineers much closer."

For enterprises, Memento’s approach is significant. It eliminates the need for expensive and time-consuming LLM retraining, allowing agents to learn on the fly. This paradigm can be integrated with existing proprietary or self-hosted open-source models and can be connected to bespoke enterprise tools and internal data sources through its flexible protocol. This enables the development of continuously improving AI systems that are both cost-effective and highly adaptive to specific business needs.

Looking ahead, Wang identifies "data acquisition" as the single biggest bottleneck to creating truly autonomous AI workers. Agents must be able to interact with their environment to receive the feedback necessary to refine their behavior. The next frontier, he suggests, is enabling "active exploration"—the ability for an agent to explore its environment independently, driven by need or even curiosity. With foundational frameworks like Memento in place, the path toward such autonomous systems becomes clearer.