LangChain improved their coding agent from Top 30 to Top 5 on Terminal Bench 2.0 by only modifying the harness, not the model. Their approach uses trace analysis to identify failure patterns, then iteratively improves the system through three key strategies: implementing build-verify loops with self-verification, injecting environmental context via middleware, and detecting doom loops. They emphasize context engineering, aggressive testing prompts, and using traces as feedback signals. The team achieved a 13.7 point improvement (52.8% to 66.5%) using GPT-5.2-Codex with optimized reasoning budgets and middleware hooks.
Table of contents
The Goal of Harness EngineeringExperiment Setup & The Knobs on a HarnessWhat Actually Improved Agent PerformancePractical Takeaways for Building Agent HarnessesSort: