The Specula team introduces SysMoBench, a benchmark for evaluating how well LLMs can generate TLA+ formal specifications that accurately model real-world concurrent and distributed systems (like Etcd, ZooKeeper, RedisRaft). The benchmark evaluates specs across four phases: syntax, runtime, conformance, and invariant checking. Results show that while leading LLMs (Claude, GPT, Gemini, DeepSeek, etc.) score near 100% on syntax, they average only ~46% on conformance and ~41% on invariants. Two systematic failure modes emerge: specs entering states real systems never reach, and specs failing to reach states real systems always reach — both caused by LLMs reciting textbook protocol templates rather than faithfully modeling actual implementations. The team also introduces Transition Validation, a per-action diagnostic method using execution traces. They note that agentic tools like Claude Code and Codex perform significantly better, and are developing Specula, a specialized TLA+ modeling agent.

10m read timeFrom sigops.org
Post cover image
Table of contents
What is SysMoBench?LLM Modeling PatternsTransition Validation: Reading Specs at Action GranularityFindings: Where the Scores DivergeOpen ChallengesWhat’s Next

Sort: