Better Harness: A Recipe for Harness Hill-Climbing with Evals

Better-Harness is a system for iteratively improving AI agent harnesses using evaluations as a learning signal. The approach treats evals like training data in classical ML, using them to guide autonomous harness updates through a loop of sourcing evals, splitting into optimization/holdout sets, running baselines, optimizing, validating, and human review. Key design decisions include tagging evals by behavioral category, using holdout sets to prevent overfitting, and mining production traces for new eval cases. Results on Claude Sonnet 4.6 and GLM-5 showed strong generalization to holdout sets after hill-climbing. The system is open-sourced and integrates with LangSmith for trace logging, production monitoring, and eval generation.

#ai-agents

#prompt-engineering

#langchain

Apr 08•11m read time•From blog.langchain.com

Table of contents

Evals are training data for Agents Sourcing good evals Better-Harness: a recipe for hill climbing your harness Examples of harness changes Results from the Better-Harness loop Evals maintenance & regressions The Future: automated error detection & fixes

Comment

Bookmark

Copy

Sort: