GenAI is reshaping the product landscape, creating huge opportunities (along with new expectations) for product managers. Yet while prompt engineering and model tuning get the spotlight, one critical skill can get overlooked: rigorous evaluation.

This talk will help PMs move beyond gut-feel “vibe checks” to adopt concrete, repeatable evaluation strategies for LLM-powered products. I'll break down essential eval methodologies, from human feedback and code-based checks to cutting-edge LLM-based evaluations. Drawing on real-world examples, I'll share a practical framework PMs can use to:

- Confidently evaluate AI-driven features
- Ground decisions in real, repeatable data
- Build trust and delight through consistent quality

AI Engineer

An evaluation framework helps AI product managers ship reliable AI applications by systematically testing LLM outputs. The framework involves creating datasets from production traces, running LLM-as-judge evaluations to assess quality metrics like tone and correctness, comparing human labels against automated eval results, and iterating on prompts using A/B testing. Key insight: evaluations should be treated as requirements documentation, with eval datasets serving as acceptance criteria. The approach addresses the non-deterministic nature of LLMs through structured testing workflows that combine automated evaluation with human verification.

Shipping AI That Works: An Evaluation Framework for PMs – Aman Khan, Arize