AE Studio built an RL-based theorem-proving workflow on Modal, comparing Group Relative Policy Optimization (GRPO) against Evolution Strategies (ES) for training LLMs to prove math theorems in Lean. The setup required three distinct execution environments: GPU-based proof generation via vLLM, CPU-based Lean verification in isolated sandboxes, and a lightweight orchestrator. Modal's per-function images, .map() fan-out, sandboxes, and volumes handled the infrastructure. A key ES optimization was stateless checkpointing — the full model state is represented as a base model plus a list of (seed, reward) pairs (~200 bytes/iteration), avoiding expensive weight transfers. The run completed in under two days at ~$122, roughly 3.7x more GPU-efficient than less elastic platforms. Early results show ES matching or outperforming GRPO in sample efficiency for theorem proving, though more experimentation is needed.
Table of contents
Motivation: Teaching AI to Prove Math TheoremsSetupThe Training ProcessHow We Built It on ModalKey Performance MetricsResults & What’s NextWhat you can learn from thisReferencesSort: