Agentic toolkit eval: dltHub REST API toolkit

A controlled experiment comparing two Claude Code agents — one with dltHub's AI workbench toolkit (MCP servers, CLAUDE.md rules, skills) and one without — across 12 runs building REST API data pipelines. The base agent leaked credentials in 100% of runs, never read current docs, skipped sampling, made only one edit per run, and often produced non-persistent pipelines. The workbench agent passed all behavioral checks: safe credential handling, doc fetching, sampling before full loads, iterative refinement, and proper pipeline initialization. The workbench costs ~58% more per run ($2.21 vs $1.40) but that premium funds the engineering practices a careful data engineer would follow. The key insight: both agents produce running code, but only the workbench-guided agent follows a process that would pass a senior engineer's code review.

#backend

#ai-agents

#claude

#vibe-coding

Apr 14•7m read time•From dlthub.com

Table of contents

The full table Link icon 1. Credentials: 0 / 3 safe → 3 / 3 safe Link icon 2. Documentation: 0 / 6 → 6 / 6 Link icon 3. Sampling: 0 / 3 → 3 / 3 Link icon 4. Iteration: 1 edit vs 3–4 Link icon 5. Pipeline persistence: 0 / 3 → 3 / 3 Link icon The cost, honestly a good thing Link icon What separates slop from engineering Link icon

Comment

Bookmark

Copy

Sort: