In this podcast, InfoQ spoke with Elena Samuylova from Evidently AI, on best practices in evaluating Large Language Model (LLM) based applications.

InfoQ is a leading online platform for software developers, architects, and technical leaders, providing news, articles, presentations, and interviews on a wide range of topics, including agile practices, DevOps, microservices, and emerging technologies. With a focus on quality content and expert insights, InfoQ helps professionals stay informed about the latest trends, best practices, and industry developments. Developers can learn from real-world experiences, gain  knowledge, and connect with peers in the global software community through InfoQ's diverse and engaging content.

InfoQ

Elena Samuylova discusses comprehensive strategies for evaluating LLM-based applications, covering the full lifecycle from initial development through production monitoring. Key topics include implementing automated scoring systems, using LLM as a judge for classification tasks, evaluating RAG systems by separately testing retrieval and generation components, designing custom evaluation criteria, leveraging synthetic data for testing, and approaching agentic workflow evaluation. The conversation emphasizes that evaluation is an iterative process requiring domain expertise, proper test data design, and continuous refinement rather than relying on out-of-the-box metrics.

Elena Samuylova on Large Language Model (LLM) Based Application Evaluation and LLM as a Judge

LLM Based Application Evaluation Process [ 07:44 ]

Custom "LLM as a Judge" Solutions [ 11:35 ]

RAG Based Application Evaluation [ 15:47 ]

Context Engineering vs Prompt Engineering [ 16:54 ]

Role of Synthetic Data in LLM Systems Evaluation [ 19:29 ]

Skillsets Required for LLM Application Evaluation [ 22:03 ]

Limitations of LLM Application Evaluation [ 25:48 ]

LLM Application Evaluation Metrics and Benchmarks [ 27:14 ]

Evaluating Agentic AI Applications [ 28:48 ]

Role of Software Development in the age of AI [ 31:27 ]