
Robert Youssef @rryssf_
Holy shit. IBM deployed AI agents in production and found that 38% of failures had nothing to do with reasoning. > The model knew the answer. It just formatted the output wrong. > JSON parsing errors. Missing fields. Schema violations. A single bad format can cascade through an 8-agent pipeline and kill the entire task. > IBM's CUGA system runs eight specialized agents in sequence Task Analyzer, API Planner, Plan Controller, Shortlister, and others each passing outputs to the next. When one agent produces malformed JSON, the downstream agents receive garbage. They don't know the upstream agent knew the answer. They just see a broken input and fail. The cascade propagates silently through the pipeline until the entire task fails. IBM ran 1,940 LLM calls across three models on 24 production tasks and built a 15-tool validation framework to systematically audit every call. What they found was not a reasoning problem. It was a formatting problem that the field has been treating as a reasoning problem. > The failure modes are specific and recurrent. API Planner the agent that generates execution plans is the single worst offender, generating high rates of schema violations, instruction non-compliance, format errors, missing few-shot coverage, and edge case gaps simultaneously. Its few-shot examples don't cover partial completions or loops. Its prompts don't handle cases where the planner needs to backtrack. Every task that hits those gaps fails not because the model can't reason about the task, but because nobody anticipated those cases in the prompt. The Task Analyzer, which initiates every trajectory, shows frequent mismatches between what its system prompt requires and what actually gets passed in. A required summary field is simply missing from inputs. > The model scale finding is the one that should change how teams think about deployment. IBM tested the same agent system with GPT-4o, Llama 4 Maverick 17B, and Mistral Medium. GPT-4o solved 58.3% of tasks. Llama 4 solved 33.3%. Mistral solved 41.7%. Then IBM ran their validation framework, identified the specific formatting failures, and fixed the prompts standardizing variable names, aligning few-shot examples with actual task logic, adding schema anchoring to the planner. The same fixes applied to all three models. The results after validation-driven prompt fixes on WebArena: → GPT-4o: 47% → 50% pass@3 modest gain, already near ceiling → Llama 4 Maverick 17B: 38% → 46% pass@3 +8 percentage points → Mistral Medium: 35% → 42% pass@3 +7 percentage points → Regression rate across all models: near zero fixes recovered failures without breaking passing tasks → GPT-4o recovered 10 previously failing tasks, regressed 1 → Llama 4 recovered 12 previously failing tasks, regressed 4 → Mistral recovered 8 previously failing tasks, regressed 2 → Parsing errors account for 38% of all observed task failures in production > The gap between frontier and smaller models narrowed substantially from fixing formatting not from switching models. Llama 4 and Mistral went from 7-25 percentage points behind GPT-4o to within striking distance, using the same weights, the same architecture, the same hardware. The difference was prompt coherence. Schema anchoring. Consistent variable names. Few-shot examples that actually match the task. IBM's framing is direct: dependability in agentic systems can be engineered through disciplined process, not merely through larger models. > The trace comparison finding adds a practical tool for debugging. IBM tested two approaches to root cause analysis: analyzing a single failed trace alone versus comparing a failed trace against a successful trace for the same task. For 46% of failure pairs, the comparison method produced substantially better explanations. For the remaining 54%, they were equal. The single-trace method never won. When you want to know why Llama 4 failed on a task that GPT-4o solved, the answer is almost always visible in the diff between their execution traces not in the failed trace alone. > The field has been buying bigger models to fix problems that better prompts would solve. IBM just showed the receipts.

Sort: