Using LLMs to conduct numeric evaluations is finicky and unreliable. Small changes in prompt templates and switching between models can lead to vastly different results. LLMs are often inconsistent in their responses, making it hard to rely on them as reliable arbiters of numeric evaluation criteria.

7m read time From towardsdatascience.com
Post cover image
Table of contents
Why You Should Not Use Numeric Evals For LLM As a JudgeTakeawaysResearchImplications for LLM EvalsConclusion

Sort: