RAG pipelines can retrieve the correct documents yet still return wrong answers when conflicting documents land in the same context window. An extractive QA model silently picks one claim over another due to position bias, language strength, and lexical alignment — with no signal that a conflict existed. A reproducible 220 MB CPU-only experiment demonstrates this across three production-realistic scenarios: financial restatements, policy revisions, and versioned API docs. The fix is a conflict detection layer inserted between retrieval and generation. Two lightweight heuristics — numerical contradiction detection and contradiction signal asymmetry — flag conflicting document pairs. A cluster-aware recency resolution strategy then keeps only the most recent document per conflict cluster. Phase 2 results show all three scenarios answered correctly with nearly identical confidence scores, proving confidence was never the right signal. Limitations include paraphrased conflicts, non-temporal disputes, and O(k²) scaling. The article also surveys recent research (CONFLICTS benchmark, TCR, CLEAR) and provides actionable guidance on logging conflict reports and surfacing uncertainty to users.

18m read timeFrom towardsdatascience.com
Post cover image
Table of contents
The System Behaved Exactly as Designed. The Answer Was Still Wrong.What the Experiment TestsThree Scenarios, Each Drawn from ProductionRunning the ExperimentPhase 1: What Naive RAG DoesWhy the Model Behaves This WayBuilding the Conflict Detection LayerThe Resolution Strategy: Cluster-Aware RecencyPhase 2: What Conflict-Aware RAG DoesWhat the Heuristics Cannot CatchWhere the Research Community Is Taking ThisWhat You Should Actually Do With ThisRunning the Full DemoThe TakeawayReferencesModels UsedDisclosure

Sort: