A controlled experiment testing different LLM generation models (GPT-5, Claude 4.5, Gemini 3, etc.) on a RAG documentation system while keeping retrieval constant. Upgrading from a legacy model to modern models improved accuracy from 3/14 to 10/14 questions correct—a 3x boost. However, all models hit the same ceiling, with

4m read timeFrom dlthub.com
Post cover image
Table of contents
Isolating the Generation Step Link iconThe Models We Tested Link iconResults: Clear Gains, Same Ceiling Link iconPersistent Failure Modes Link iconWhat This Round Actually Measured Link iconWhat This Tells Us Link iconWhat’s Next Link icon

Sort: