By upgrading only the generative model, we achieved a 3x accuracy boost but hit a hard ceiling, proving that not only LLMs are needed for good retrieval.

dltHub

A controlled experiment testing different LLM generation models (GPT-5, Claude 4.5, Gemini 3, etc.) on a RAG documentation system while keeping retrieval constant. Upgrading from a legacy model to modern models improved accuracy from 3/14 to 10/14 questions correct—a 3x boost. However, all models hit the same ceiling, with persistent failures in needle-in-haystack retrieval, multiple-choice hallucinations, and omitted critical details. The results prove that better generation models help significantly but cannot compensate for poor retrieval, indicating the next optimization should focus on embedding models and retrieval quality.

Debugging Our Docs RAG, Part 2: Testing New Generation

Results: Clear Gains, Same Ceiling Link icon

What This Round Actually Measured Link icon