Microsoft has released STATE-Bench, an open source benchmark designed to evaluate AI agent memory systems in enterprise settings. Unlike existing benchmarks that focus on simple conversational retrieval tasks, STATE-Bench is memory-agnostic and measures whether a memory layer actually improves agent task performance over repeated runs. It provides three enterprise domains (e.g., airline booking), a simulated environment with a real database, deterministic evaluation via database state diffs (avoiding LLM-as-judge where possible), and a user simulator. Users bring only their own learning/memory layer, subclassing a provided base agent, then run training on 100 tasks per domain and evaluate on a test set. Results are submitted via GitHub issues to an upcoming leaderboard. The benchmark is also being explored for uses beyond memory, such as system prompt optimization and self-improving agents.
Sort: