A controlled experiment comparing two Claude Code agents — one with dltHub's AI workbench toolkit (MCP servers, CLAUDE.md rules, skills) and one without — across 12 runs building REST API data pipelines. The base agent leaked credentials in 100% of runs, never read current docs, skipped sampling, made only one edit per run, and often produced non-persistent pipelines. The workbench agent passed all behavioral checks: safe credential handling, doc fetching, sampling before full loads, iterative refinement, and proper pipeline initialization. The workbench costs ~58% more per run ($2.21 vs $1.40) but that premium funds the engineering practices a careful data engineer would follow. The key insight: both agents produce running code, but only the workbench-guided agent follows a process that would pass a senior engineer's code review.
Table of contents
The full table Link icon1. Credentials: 0 / 3 safe → 3 / 3 safe Link icon2. Documentation: 0 / 6 → 6 / 6 Link icon3. Sampling: 0 / 3 → 3 / 3 Link icon4. Iteration: 1 edit vs 3–4 Link icon5. Pipeline persistence: 0 / 3 → 3 / 3 Link iconThe cost, honestly a good thing Link iconWhat separates slop from engineering Link iconSort: