In addition to generating text for a growing number of industry applications, LLMs are now widely being used as evaluation tools. Models quantify the relevance of retrieved documents in retrieval…

Towards Data Science is a community-powered publication that showcases work in data science, machine learning and artificial intelligence. Every day newcomers, seasoned researchers and industry practitioners publish tutorials, research notes and real-world case studies that help the field move forward.

Towards Data Science

Using LLMs to conduct numeric evaluations is finicky and unreliable. Small changes in prompt templates and switching between models can lead to vastly different results. LLMs are often inconsistent in their responses, making it hard to rely on them as reliable arbiters of numeric evaluation criteria.

Why You Should Not Use Numeric Evals For LLM As a Judge