EVA (Evaluation of Voice Agents) is an open-source end-to-end framework from ServiceNow for evaluating conversational voice agents across two dimensions: EVA-A (Accuracy) and EVA-X (Experience). It uses a bot-to-bot audio architecture with five components: user simulator, voice agent, tool executor, validators, and a metrics suite. The framework evaluates multi-turn spoken conversations and is the first to jointly score task success and conversational experience. Benchmarking 20 cascade and audio-native systems on a 50-scenario airline dataset revealed a consistent accuracy-experience tradeoff — agents good at task completion tend to deliver worse user experiences. Named entity transcription errors and multi-step workflow failures were identified as dominant failure modes. Code, dataset, and judge prompts are fully open-sourced.

11m read timeFrom huggingface.co
Post cover image
Table of contents
Introduction Background and Motivation EVA The Framework Data Evaluation Methodology Findings Limitations What's Next Getting Started Acknowledgements Citation IntroductionBackground and MotivationEVAFindingsLimitationsWhat's NextGetting StartedAcknowledgementsCitation

Sort: