Long-running AI agents that operate over hours, days, or weeks represent the next evolution beyond single-session chat-based agents. Three core engineering challenges must be solved: finite context windows, lack of persistent state, and unreliable self-verification. Addy Osmani surveys how Anthropic, Cursor, and Google have converged on similar architectural patterns — separating the model loop (brain) from execution sandboxes (hands) and durable session logs — while differing in surface area and productization. Practical patterns covered include the Ralph loop (a simple bash-based task loop), checkpoint-and-resume, human-in-the-loop delegation, memory-layered context, ambient processing, and fleet orchestration. Key takeaways: define done-conditions before the agent starts, separate evaluator from generator, invest in append-only session logs, and treat context resets as first-class operations. Real limitations remain around cost, security, alignment drift, and verification overhead.
Table of contents
What “long-running” actually meansWhy this mattersThe three walls every long-running agent hitsThe Ralph loop: one of the simpler practitioner versions of long-running agentsAnthropic: harnesses, then the brain/hands/session splitCursor: planners, workers, judgesGoogle: long-running agents on the Agent PlatformFive patterns for long-running agents in productionSo how do you actually build one today?There are some real limitations right nowWhere this is going3 Comments
Sort: