DoorDash engineers built a simulation and evaluation flywheel to test LLM-powered customer support chatbots at scale. The system uses historical support transcripts to generate multi-turn synthetic conversations, with mocked backend APIs for realistic scenarios. An LLM plays the customer role while the production chatbot responds, and an automated LLM-as-judge framework evaluates outcomes across metrics like hallucination rates, tone, and task completion. The flywheel enables rapid iteration on prompts and context before deployment. Context engineering improvements validated through this system reduced hallucination rates by roughly 90% before going live.
Sort: