AE Studio built an RL-based theorem-proving workflow on Modal, comparing Group Relative Policy Optimization (GRPO) against Evolution Strategies (ES) for training LLMs to prove math theorems in Lean. The setup required three distinct execution environments: GPU-based proof generation via vLLM, CPU-based Lean verification in isolated sandboxes, and a lightweight orchestrator. Modal's per-function images, .map() fan-out, sandboxes, and volumes handled the infrastructure. A key ES optimization was stateless checkpointing — the full model state is represented as a base model plus a list of (seed, reward) pairs (~200 bytes/iteration), avoiding expensive weight transfers. The run completed in under two days at ~$122, roughly 3.7x more GPU-efficient than less elastic platforms. Early results show ES matching or outperforming GRPO in sample efficiency for theorem proving, though more experimentation is needed.

12m read timeFrom modal.com
Post cover image
Table of contents
Motivation: Teaching AI to Prove Math TheoremsSetupThe Training ProcessHow We Built It on ModalKey Performance MetricsResults & What’s NextWhat you can learn from thisReferences

Sort: