Better-Harness is a system for iteratively improving AI agent harnesses using evaluations as a learning signal. The approach treats evals like training data in classical ML, using them to guide autonomous harness updates through a loop of sourcing evals, splitting into optimization/holdout sets, running baselines, optimizing, validating, and human review. Key design decisions include tagging evals by behavioral category, using holdout sets to prevent overfitting, and mining production traces for new eval cases. Results on Claude Sonnet 4.6 and GLM-5 showed strong generalization to holdout sets after hill-climbing. The system is open-sourced and integrates with LangSmith for trace logging, production monitoring, and eval generation.

11m read timeFrom blog.langchain.com
Post cover image
Table of contents
Evals are training data for AgentsSourcing good evalsBetter-Harness: a recipe for hill climbing your harnessExamples of harness changesResults from the Better-Harness loopEvals maintenance & regressionsThe Future: automated error detection & fixes

Sort: