As large language models become more capable, users are tempted to delegate knowledge tasks where models process documents on their behalf and provide the finished results. But how far can you trust the model to stay faithful to the content of your documents when it has to iterate over them across multiple rounds?
A new study by researchers at Microsoft shows that large language models silently corrupt documents that they work on by introducing errors. The researchers developed a benchmark that simulates multi-step autonomous workflows across 52 professional domains, using a method that automatically measures how much content degrades over time.
Their findings show that even top-tier frontier models corrupt an average of 25% of document content by the end of these workflows. And providing models with agentic tools or realistic distractor documents actually worsens their performance.
This serves as a warning that while there is increasing pressure to automate knowledge work, current language models are not fully reliable for these tasks.
The mechanics of delegated work
The Microsoft study focuses on “delegated work,” an emerging paradigm where users allow LLMs to complete knowledge tasks on their behalf by analyzing and modifying documents.
A prominent example of this paradigm is vibe coding, where a user delegates software development and code editing to an AI. But delegated workflows extend far beyond programming into other domains. In accounting, for example, a user might supply a dense ledger and instruct the model to split the document into separate files organized by specific expense categories.
Because users might lack the time or the specialized expertise to manually review every modification the AI implements, delegation often hinges on trust. Users expect that the model will faithfully complete tasks without introducing unchecked errors, unauthorized deletions, or hallucinations in the documents.
To measure how far AI systems can be trusted in extended, iterative delegated workflows, the researchers developed the DELEGATE-52 benchmark. The benchmark is composed of 310 work environments spanning 52 diverse professional domains, including financial accounting, software engineering, crystallography, and music notation.

Each work environment relies on real-world seed text documents ranging from 2,000 to 5,000 tokens. Alongside the seed document, the environments include five to ten complex, non-trivial editing tasks.
Grading a complex, multi-step editing process usually requires expensive human review. DELEGATE-52 bypasses this by using a “round-trip relay” simulation method that evaluates answers without requiring human-annotated reference solutions. The approach is inspired by the backtranslation technique used in machine translation evaluation, where an AI model is told to translate a document from one language to another and back to see how perfectly it reproduces the original version.
Accordingly, every edit task in DELEGATE-52 is designed to be fully reversible, pairing a forward instruction with its precise inverse. For example, an instruction to split the ledger into separate files by expense category is paired with an instruction to merge all category files back into a single ledger.
In comments provided to VentureBeat, Philippe Laban, Senior Researcher at Microsoft Research and co-author of the paper, clarified that this is not simply a test of whether an AI can hit "undo." Because human workers cannot be forced to instantly "forget" a task they just did, this round-trip evaluation is uniquely suited for AI. By starting a new conversational session, the researchers force the model to attempt the inverse task completely independently.
The models in their experiments “do not know whether a task is a forward or backward step and are unaware of the overall experiment design," Laban explained. "They are simply attempting each task as thoroughly as they can at each step."

These roundtrip tasks are chained together into a continuous relay to simulate long-horizon workflows spanning 20 consecutive interactions. To make the environment more realistic, the benchmark introduces distractor files in the context of each task. These contain 8,000 to 12,000 tokens of topically related but completely irrelevant documents. Distractors measure whether the AI can maintain focus or if it gets confused and pulls in the wrong data.
Testing frontier models in the relay
To understand how different architectures and scales handle delegated work, the researchers tested 19 different language models from OpenAI, Anthropic, Google, Mistral, xAI, and Moonshot. The main experiment subjected these models to a simulation of 20 consecutive editing interactions.
Across all models, documents suffered an average degradation of 50% by the end of the simulation. Even the best frontier models in the experiment, specifically Gemini 3.1 Pro, Claude 4.6 Opus, and GPT 5.4, corrupted an average of 25% of the document content.
Out of 52 professional domains, Python was the only one where most models achieved a ready status with a score of 98% or higher. Models excel in programmatic tasks but struggle severely in natural language and niche domains like fiction, earning statements, or recipes. The overall top model, Gemini 3.1 Pro, was deemed ready for delegated work in only 11 out of the 52 domains.

Interestingly, the corruption was not caused by death by a thousand cuts where the models slowly accumulate tiny errors. Instead, about 80% of total degradation is caused by sparse but massive critical failures, which are single interactions where a model suddenly drops at least 10% of the document's content. The frontier models do not necessarily avoid small errors better. They simply delay these catastrophic failures to later rounds.
Another important observation is that when weaker models fail, their degradation originates primarily from content deletion. However, when frontier models fail, they actively corrupt the existing content. The text is still there, but it has been subtly distorted or hallucinated, making it much harder for a human overseer to detect the error.
Interestingly, giving models an agentic harness with generic tools for code execution and file read/write access actually worsened their performance, adding an average of 6% more degradation. Laban explained that the failure lies in relying on generic tools rather than domain-specific ones.
"Models lack the capability to write effective programs on the fly that can manipulate files across diverse domains without mistakes," he noted. "When they cannot do something programmatically, they resort to reading and rewriting entire files, which is less efficient and more error prone." The solution for developers is to build tightly scoped tools (such as specific functions to calculate or move entries within .ledger files) to keep agents on track.
Degradation also snowballs as documents get larger or as more distractor files are added to the workspace. For enterprise teams investing heavily in retrieval-augmented generation (RAG), these distractor documents serve as a direct warning about the compounding cost of messy context. While a noisy context window might cause a minimal 1% performance drop after just two interactions, that degradation compounds to a massive 2-8% drop over a long simulation.
"For the retrieval community: RAG pipelines should be evaluated over multi-step workflows, not just single-turn retrieval benchmarks," Laban said. "Single-turn measurements systematically underestimate the harm of imprecise retrieval."
Reality check for the autonomous enterprise
The findings from the DELEGATE-52 benchmark offer a critical reality check for the current hype surrounding fully autonomous AI agents.
The benchmark's design also implies a practical constraint: because models can maintain a clean record for several steps before a sudden catastrophic failure, incremental human review is necessary — not a single final check. Laban recommends building AI applications around short, transparent tasks rather than complex long-horizon agents. This keeps the action implication without the writer delivering the prescription.
For organizations wanting to deploy autonomous agents safely today, the DELEGATE-52 methodology provides a practical blueprint for testing in-house data pipelines. Laban explained that "… an enterprise team wanting to adopt this framework needs to build three components: (a) a set of reversible editing tasks representative of their workflows, (b) a parser that converts their domain documents into a structured representation, and (c) a similarity function that compares two parsed representations." Teams do not even need to build parsers from scratch. The Microsoft research team successfully repurposed existing parsing libraries for 30 out of the 52 domains tested.
Laban is optimistic about the rate of improvement. "Progress is real and fast. Looking at the GPT family alone, models go from scoring below 20% to around 70% in 18 months," Laban said. "If that trajectory continues, models will soon be able to achieve saturated scores on DELEGATE-52."
However, Laban cautioned that DELEGATE-52 is purposefully small compared to massive enterprise environments. Even as foundation models inevitably master this benchmark, the endless long-tail of unique enterprise data and workflows means organizations will always need to invest in custom, domain-specific tooling to keep their autonomous agents reliable.
