An analysis of how analytics agents think when solving text-to-SQL problems, using a 50-question sample from the BIRD-Bench benchmark. Claude Opus 4.5 with the MotherDuck MCP Server was used to generate chain-of-thought traces, which were then classified by a team of Claude sub-agents acting as judges. Key findings: single-shot answers succeed 91% of the time, iterative loops succeed 64% of the time, and struggling agents fail completely. A notable failure case shows the agent confusing semantically similar columns (position vs rank). The post also questions whether semantic layers truly solve these ambiguity problems, suggesting query history as a more adaptive source of context.
Sort: