A New Framework for Evaluation of Voice Agents (EVA)

EVA (Evaluation of Voice Agents) is an open-source end-to-end framework from ServiceNow for evaluating conversational voice agents across two dimensions: EVA-A (Accuracy) and EVA-X (Experience). It uses a bot-to-bot audio architecture with five components: user simulator, voice agent, tool executor, validators, and a metrics suite. The framework evaluates multi-turn spoken conversations and is the first to jointly score task success and conversational experience. Benchmarking 20 cascade and audio-native systems on a 50-scenario airline dataset revealed a consistent accuracy-experience tradeoff — agents good at task completion tend to deliver worse user experiences. Named entity transcription errors and multi-step workflow failures were identified as dominant failure modes. Code, dataset, and judge prompts are fully open-sourced.

#conversational-ai

Mar 24•11m read time•From huggingface.co

Table of contents

Introduction Background and Motivation EVA The Framework Data Evaluation Methodology Findings Limitations What's Next Getting Started Acknowledgements Citation Introduction Background and Motivation EVA Findings Limitations What's Next Getting Started Acknowledgements Citation

Comment

Bookmark

Copy

Sort: