Deep Agents provide a framework for quickly building AI agents with built-in planning, filesystem tools, and subagent capabilities. This guide demonstrates a systematic evaluation workflow using Harbor for sandboxed execution, Terminal Bench 2.0 for benchmarking across 89 real-world tasks, and LangSmith for observability and trace analysis. The workflow establishes baseline performance (42.65% on Terminal Bench), identifies optimization opportunities through trace analysis, and enables data-driven improvements like reducing environment setup latency by pre-populating context information in prompts.
Table of contents
The Problem: How Do You Measure and Optimize?Step 1: How Do We Run the Agent?Step 2: What Do We Test the Agent On?Step 3: How Do We Make It Better?Analyzing Traces to Identify ImprovementsSummaryResourcesSort: