A practitioner with 10+ years of experience benchmarks an LLM-generated Rust reimplementation of SQLite and finds it 20,171x slower on primary key lookups. Two root causes are identified: the query planner never checks the `is_ipk` flag so every WHERE clause does a full table scan instead of a B-tree search, and every bare INSERT triggers a full fsync rather than fdatasync. Five compounding performance anti-patterns are also documented (AST clone on cache hit, per-read heap allocation, schema reload on every autocommit, eager formatting in hot path, new objects per statement). A second case study shows an 82,000-line Rust disk-cleanup daemon that could be replaced by a one-line cron job. The author ties both failures to LLM sycophancy—models optimize for plausible-looking output matching the prompt's intent rather than correctness—and cites METR's RCT (AI made experienced developers 19% slower), GitClear's code-quality analysis, the Mercury benchmark (under 50% when efficiency is required), and the Replit database deletion incident. The conclusion: LLMs are productive only when the developer can define measurable acceptance criteria and verify the output independently.

22m read timeFrom blog.katanaquant.com
Post cover image
Table of contents
LLMs Lie. Numbers Don’t.What the Planner Gets WrongThe Compound EffectSame Method, Same ResultIntent vs. CorrectnessEvidence Beyond Case StudiesWhat Competent Looks LikeMeasure What MattersSources
1 Comment

Sort: