Testing large language models presents unique challenges compared to traditional software testing. LLMs are expensive to test thoroughly, behave inconsistently, and create unbounded regression testing problems. Common GenAI demos are unreliable because they show single runs without careful output analysis. The LARC (LLM Aggregated Retrieval Consistency) methodology tests self-consistency by repeatedly prompting models to retrieve information from text and checking for contradictions across multiple runs, revealing reliability issues in basic retrieval tasks.
Sort: