EVA (Evaluation of Voice Agents) is an open-source end-to-end framework from ServiceNow for evaluating conversational voice agents across two dimensions: EVA-A (Accuracy) and EVA-X (Experience). It uses a bot-to-bot audio architecture with five components: user simulator, voice agent, tool executor, validators, and a metrics suite. The framework evaluates multi-turn spoken conversations and is the first to jointly score task success and conversational experience. Benchmarking 20 cascade and audio-native systems on a 50-scenario airline dataset revealed a consistent accuracy-experience tradeoff — agents good at task completion tend to deliver worse user experiences. Named entity transcription errors and multi-step workflow failures were identified as dominant failure modes. Code, dataset, and judge prompts are fully open-sourced.
Table of contents
Introduction Background and Motivation EVA The Framework Data Evaluation Methodology Findings Limitations What's Next Getting Started Acknowledgements Citation IntroductionBackground and MotivationEVAFindingsLimitationsWhat's NextGetting StartedAcknowledgementsCitationSort: