Evaluations are the single most reliable indicator of the health and long term viability of any gen AI project.  As a Principal Applied AI Architect for AWS, I've had the opportunity to look at over 100 different attempts at evaluation frameworks over the last few years. 
In this talk I share some stories about the best and worst, and then distill the 7 most common elements I've seen in successful evaluations.  

Slides at https://genai071123.s3.us-east-1.amazonaws.com/slides/7+Habits+AI+World's+Fair.pptx

AI Engineer

AWS principal architect shares seven essential habits for effective generative AI evaluations based on experience scaling workloads across industries. Key practices include building fast evaluation frameworks (30-second target), creating quantifiable metrics with numerous test cases, making evaluations explainable by examining model reasoning, segmenting complex prompts into evaluable steps, ensuring diverse test coverage, and combining traditional evaluation methods with AI-based judging. The talk emphasizes that evaluations are the missing piece for scaling GenAI projects, with a customer example showing accuracy improvement from 22% to 92% after implementing proper evaluation frameworks.

7 Habits of Highly Effective Generative AI Evaluations - Justin Muller