A detailed walkthrough of an 8-stage evaluation framework for AI agents, developed while building an IT self-service agent quickstart on Red Hat OpenShift AI. Covers the progression from manual testing to automated multi-turn conversation evaluation using DeepEval, including custom metrics with LLM-as-judge, conversation generation, known-bad test cases, CI/CD integration, and cost tracking. Key insights include the need for capable evaluator models, the importance of testing your metrics against known failures, and practical token cost estimates for running evaluations at scale.

Table of contents
About AI quickstartsOur evaluations journeyAn example conversationManual testing with a few predefined conversationsAutomated evaluationGenerating conversationsKnown bad conversationsThe complete flowCostWrapping upNext stepsTo learn moreSort: