DoorDash engineers built a simulation and evaluation flywheel to test large language model customer support chatbots at scale. The system generates multi-turn synthetic conversations using historical

InfoQ is a leading online platform for software developers, architects, and technical leaders, providing news, articles, presentations, and interviews on a wide range of topics, including agile practices, DevOps, microservices, and emerging technologies. With a focus on quality content and expert insights, InfoQ helps professionals stay informed about the latest trends, best practices, and industry developments. Developers can learn from real-world experiences, gain  knowledge, and connect with peers in the global software community through InfoQ's diverse and engaging content.

InfoQ

DoorDash engineers built a simulation and evaluation flywheel to test LLM-powered customer support chatbots at scale. The system uses historical support transcripts to generate multi-turn synthetic conversations, with mocked backend APIs for realistic scenarios. An LLM plays the customer role while the production chatbot responds, and an automated LLM-as-judge framework evaluates outcomes across metrics like hallucination rates, tone, and task completion. The flywheel enables rapid iteration on prompts and context before deployment. Context engineering improvements validated through this system reduced hallucination rates by roughly 90% before going live.

DoorDash Builds LLM Conversation Simulator to Test Customer Support Chatbots at Scale