Can LLMs model real-world systems in TLA+?

The Specula team introduces SysMoBench, a benchmark for evaluating how well LLMs can generate TLA+ formal specifications that accurately model real-world concurrent and distributed systems (like Etcd, ZooKeeper, RedisRaft). The benchmark evaluates specs across four phases: syntax, runtime, conformance, and invariant checking. Results show that while leading LLMs (Claude, GPT, Gemini, DeepSeek, etc.) score near 100% on syntax, they average only ~46% on conformance and ~41% on invariants. Two systematic failure modes emerge: specs entering states real systems never reach, and specs failing to reach states real systems always reach — both caused by LLMs reciting textbook protocol templates rather than faithfully modeling actual implementations. The team also introduces Transition Validation, a per-action diagnostic method using execution traces. They note that agentic tools like Claude Code and Codex perform significantly better, and are developing Specula, a specialized TLA+ modeling agent.

#llm

#distributed-systems

Today•10m read time•From sigops.org

Table of contents

What is SysMoBench?LLM Modeling Patterns Transition Validation: Reading Specs at Action Granularity Findings: Where the Scores Diverge Open Challenges What’s Next

Comment

Bookmark

Copy

Sort: