DoorDash built a custom AI code review agent that achieves a 60.2% acceptance rate on high and critical findings across 10,000+ weekly PRs. The system evolved through three versions, with the key innovation being a 'lead scout' that identifies suspicious areas before two deep reviewers investigate them — separating noticing from verifying. The architecture uses per-domain review profiles mined from historical PRs, Slack decisions, and incident history rather than generic AGENTS.md files. A precision-over-recall philosophy means the agent posts fewer but higher-quality comments, each anchored to specific lines with evidence. The system also includes a fixer agent that can apply suggested changes directly to PRs via remote VMs. Key engineering lessons include using per-agent soft/hard timeouts to handle stuck agents, measuring cost per successful review rather than token price, and building evals from real past incidents rather than synthetic puzzles.
Table of contents
The numbersHow we got hereThe design principle: precision over recallFocused context beats more contextWhy we built this ourselvesClosing the loop from review to fixWhat it's actually good atEngineering lessons from productionEvals are the development loopWhat's nextSort: