One major challenge in deploying autonomous agents is building systems that can adapt to changes in their environments without the need to retrain the underlying large language models (LLMs).

Memento-Skills, a new framework developed by researchers at multiple universities, addresses this bottleneck by giving agents the ability to develop their skills by themselves. "It adds its continual learning capability to the existing offering in the current market, such as OpenClaw and Claude Code," Jun Wang, co-author of the paper, told VentureBeat.

Memento-Skills acts as an evolving external memory, allowing the system to progressively improve its capabilities without modifying the underlying model. The framework provides a set of skills that can be updated and expanded as the agent receives feedback from its environment.

For enterprise teams running agents in production, that matters. The alternative — fine-tuning model weights or manually building skills — carries significant operational overhead and data requirements. Memento-Skills sidesteps both.

The challenges of building self-evolving agents

Self-evolving agents are crucial because they overcome the limitations of frozen language models. Once a model is deployed, its parameters remain fixed, restricting it to the knowledge encoded during training and whatever fits in its immediate context window.

Giving the model an external memory scaffolding enables it to improve without the costly and slow process of retraining. However, current approaches to agent adaptation largely rely on manually-designed skills to handle new tasks. While some automatic skill-learning methods exist, they mostly produce text-only guides that amount to prompt optimization. Other approaches simply log single-task trajectories that don’t transfer across different tasks.

Furthermore, when these agents try to retrieve relevant knowledge for a new task, they typically rely on semantic similarity routers, such as standard dense embeddings; high semantic overlap does not guarantee behavioral utility. An agent relying on standard RAG might retrieve a "password reset" script to solve a "refund processing" query simply because the documents share enterprise terminology.

"Most retrieval-augmented generation (RAG) systems rely on similarity-based retrieval. However, when skills are represented as executable artifacts such as markdown documents or code snippets, similarity alone may not select the most effective skill," Wang said. 

How Memento-Skills stores and updates skills

To solve the limitations of current agentic systems, the researchers built Memento-Skills. The paper describes the system as “a generalist, continually-learnable LLM agent system that functions as an agent-designing agent.” Instead of keeping a passive log of past conversations, Memento-Skills creates a set of skills that act as a persistent, evolving external memory.

Read-Write Reflective Learning

Read-Write Reflective Learning (source: arXiv)

These skills are stored as structured markdown files and serve as the agent's evolving knowledge base. Each reusable skill artifact is composed of three core elements. It contains declarative specifications that outline what the skill is and how it should be used. It includes specialized instructions and prompts that guide the language model's reasoning. And it houses the executable code and helper scripts that the agent runs to actually solve the task.

Memento-Skills achieves continual learning through its "Read-Write Reflective Learning" mechanism, which frames memory updates as active policy iteration rather than passive data logging. When faced with a new task, the agent queries a specialized skill router to retrieve the most behaviorally relevant skill — not just the most semantically similar one — and executes it.

After the agent executes the skill and receives feedback, the system reflects on the outcome to close the learning loop. Rather than just appending a log of what happened, the system actively mutates its memory. If the execution fails, an orchestrator evaluates the trace and rewrites the skill artifacts. This means it directly updates the code or prompts to patch the specific failure mode. In case of need, it creates an entirely new skill.

Memento-Skills also updates the skill router through a one-step offline reinforcement learning process that learns from execution feedback rather than just text overlap. "The true value of a skill lies in how it contributes to the overall agentic workflow and downstream execution,”  Wang said. “Therefore, reinforcement learning provides a more suitable framework, as it enables the agent to evaluate and select skills based on long-term utility."

Memento-Skills framework

Memento-Skills framework (source: arXiv)

To prevent regression in a production environment, the automated skill mutations are guarded by an automatic unit-test gate. The system generates a synthetic test case, executes it through the updated skill, and checks the results before saving the changes to the global library.

By continuously rewriting and refining its own executable tools, Memento-Skills enables a frozen language model to build robust muscle memory and progressively expand its capabilities end-to-end.

Putting the self-evolving agent to the test

The researchers evaluated Memento-Skills on two rigorous benchmarks. The first is General AI Assistants (GAIA), which requires complex multi-step reasoning, multi-modality handling, web browsing, and tool use. The second is Humanity's Last Exam, or HLE, an expert-level benchmark spanning eight diverse academic subjects like mathematics and biology. The entire system was powered by Gemini-3.1-Flash acting as the underlying frozen language model.

The system was compared against a Read-Write baseline that retrieves skills and collects feedback but doesn’t have self-evolving features. The researchers also tested their custom skill router against standard semantic retrieval baselines, including BM25 and Qwen3 embeddings.

Memento-skills performance

Performance on the GAIA benchmark (Memento-Skills vs Read-Write) (source: arXiv)

The results proved that actively self-evolving memory vastly outperforms a static skill library. On the highly diverse GAIA benchmark, Memento-Skills improved test set accuracy by 13.7 percentage points over the static baseline, achieving 66.0% compared to 52.3%. On the HLE benchmark, where the domain structure allowed for massive cross-task skill reuse, the system more than doubled the baseline's performance, jumping from 17.9% to 38.7%.

Moreover, the specialized skill router of Memento-Skills avoids the classic retrieval trap where an irrelevant skill is selected simply because of semantic similarity. Experiments show that Memento-Skills boosts end-to-end task success rates to 80%, compared to just 50% for standard BM25 retrieval.

The researchers observed that Memento-Skills manages this performance through highly organic, structured skill growth. Both benchmark experiments started with just five atomic seed skills, such as basic web search and terminal operations. On the GAIA benchmark, the agent autonomously expanded this seed group into a compact library of 41 skills to handle the diverse tasks. On the expert-level HLE benchmark, the system dynamically scaled its library to 235 distinct skills. 

Memento-skills skill development

Memento-Skills starts with a seed of skills (stars) and develops more skills (circles) as it solves tasks (source: arXiv)

Finding the enterprise sweet spot

The researchers have released the code for Memento-Skills on GitHub, and it is readily available for use.

For enterprise architects, the effectiveness of this system depends on domain alignment. Instead of simply looking at benchmark scores, the core business tradeoff lies in whether your agents are handling isolated tasks or structured workflows.

"Skill transfer depends on the degree of similarity between tasks," Wang said. "First, when tasks are isolated or weakly related, the agent cannot rely on prior experience and must learn through interaction." In such scattershot environments, cross-task transfer is limited. "Second, when tasks share substantial structure, previously acquired skills can be directly reused. Here, learning becomes more efficient because knowledge transfers across tasks, allowing the agent to perform well on new problems with little or no additional interaction."

Given that the system requires recurring task patterns to consolidate knowledge, enterprise leaders need to know exactly where to deploy this today and where to hold off.

"Workflows are likely the most appropriate setting for this approach, as they provide a structured environment in which skills can be composed, evaluated, and improved," Wang said.

However, he cautioned against over-deployment in areas not yet suited for the framework. "Physical agents remain largely unexplored in this context and require further investigation. In addition, tasks with longer horizons may demand more advanced approaches, such as multi-agent LLM systems, to enable coordination, planning, and sustained execution over extended sequences of decisions."

As the industry moves toward agents that autonomously rewrite their own production code, governance and security remain paramount. While Memento-Skills employs foundational safety rails like automatic unit-test gates, a broader framework will likely be needed for enterprise adoption.

"To enable reliable self-improvement, we need a well-designed evaluation or judge system that can assess performance and provide consistent guidance," Wang said. "Rather than allowing unconstrained self-modification, the process should be structured as a guided form of self-development, where feedback steers the agent toward better designs."