Testing large language models presents unique challenges compared to traditional software testing. LLMs are expensive to test thoroughly, behave inconsistently, and create unbounded regression testing problems. Common GenAI demos are unreliable because they show single runs without careful output analysis. The LARC (LLM Aggregated Retrieval Consistency) methodology tests self-consistency by repeatedly prompting models to retrieve information from text and checking for contradictions across multiple runs, revealing reliability issues in basic retrieval tasks.

6m read timeFrom satisfice.com
Post cover image
Table of contents
GenAI Demos Are Nearly WorthlessOne of My Experiments: LARC

Sort: