To trust an LLM, make lying harder than telling the truth

Swimm's CTO shares lessons from building a semantic equivalence checker for AI-generated functional specs. Initial attempts using direct LLM comparison achieved 97% accuracy but failed on long specs where requirements were missed. Breaking down the task paradoxically worsened results to 46% accuracy, with the LLM hallucinating matches. The breakthrough came from forcing structured outputs: requiring specific evidence (matching IDs, quoted text) before conclusions, adding explicit verdict categories, and ordering the response to demand proof before judgment. This approach eliminated hallucinations by making lying harder than telling the truth, demonstrating that LLM reliability depends on designing response structures that require verifiable grounding.

#ai

#testing

#llm

#prompt-engineering

Oct 30, 2025•14m read time•From swimm.io

Table of contents

The LLM Said It Found a Match. It Was Lying.Setting the Stage “True Spec” vs “Generated Spec”Validating the Semantic Equivalence Checker Attempt #1: Direct Comparison Attempt #2: Decomposition Attempt #3: Adding Structure Attempt #4: Forcing Evidence Stepping Back: When Did the LLM Lie? What Can It Teach Us?The Key Insight: LLMs Don’t Lie When Cornered with Specificity The Broader Lesson