[2604.15597] LLMs Corrupt Your Documents When You Delegate

This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).

A research paper introduces DELEGATE-52, a benchmark simulating long delegated workflows across 52 professional domains to evaluate LLM reliability in document editing tasks. Testing 19 LLMs reveals that even frontier models (Gemini, Claude, GPT) corrupt an average of 25% of document content during extended interactions. Key findings: agentic tool use doesn't improve performance, and degradation worsens with document size, interaction length, and presence of distractor files. The errors are sparse but severe, silently compounding over time — making current LLMs unreliable for delegated knowledge work.

2m read timeFrom arxiv.org
Post cover image

Sort: