A comprehensive FAQ covering practical AI evaluation strategies, including when to use RAG vs alternatives, building custom annotation tools, implementing binary evaluations over rating scales, debugging multi-turn conversations, and creating effective synthetic test data. The guide emphasizes error analysis as the foundation
Table of contents
Q: Is RAG dead?Q: Can I use the same model for both the main task and evaluation?Q: How much time should I spend on model selection?Q: Should I build a custom annotation tool or use something off-the-shelf?Q: Why do you recommend binary (pass/fail) evaluations instead of 1-5 ratings (Likert scales)?Q: How do I debug multi-turn conversation traces?Q: Should I build automated evaluators for every failure mode I find?Q: How many people should annotate my LLM outputs?Q: What gaps in eval tooling should I be prepared to fill myself?Q: What is the best approach for generating synthetic data?Q: How do I approach evaluation when my system handles diverse user queries?Q: How do I choose the right chunk size for my document processing tasks?Q: How should I approach evaluating my RAG system?Q: What makes a good custom interface for reviewing LLM outputs?Q: How much of my development budget should I allocate to evals?Q: Why is “error analysis” so important in LLM evals, and how is it performed?Q: What’s the difference between guardrails & evaluators?Q: What’s a minimum viable evaluation setup?Q: How do I evaluate agentic workflows?Q: Seriously Hamel. Stop the bullshit. What’s your favorite eval vendor?Q: How are evaluations used differently in CI/CD vs. monitoring production?Sort: