A controlled experiment testing different LLM generation models (GPT-5, Claude 4.5, Gemini 3, etc.) on a RAG documentation system while keeping retrieval constant. Upgrading from a legacy model to modern models improved accuracy from 3/14 to 10/14 questions correct—a 3x boost. However, all models hit the same ceiling, with
Table of contents
Isolating the Generation Step Link iconThe Models We Tested Link iconResults: Clear Gains, Same Ceiling Link iconPersistent Failure Modes Link iconWhat This Round Actually Measured Link iconWhat This Tells Us Link iconWhat’s Next Link iconSort: