Spotify's background coding agents use strong verification loops to ensure reliable automated code changes at scale. The system implements multiple layers of validation: deterministic verifiers that check formatting, building, and testing; and an LLM-based judge that prevents agents from going beyond their instructions. This architecture addresses three failure modes: failed PR generation, PRs that fail CI, and functionally incorrect code that passes CI. The verification loops provide incremental feedback while abstracting complexity from the agent's context window. The judge vetoes about 25% of agent sessions, with agents successfully course-correcting half the time. Future plans include expanding verifier infrastructure to support more platforms, deeper CI/CD integration, and implementing structured evaluations.
Table of contents
How things failDesigning for predictability: verification loopsUsing LLMs in the verification loopsKeeping the Agent FocusedThe FutureSort: