ProgramBench is a new benchmark evaluating whether LLM-based software engineering agents can rebuild entire programs from scratch given only a reference executable and its documentation. Unlike existing benchmarks that focus on narrow tasks like bug fixes or single features, ProgramBench requires agents to make holistic software architecture decisions. The 200 tasks span compact CLI tools to major projects like FFmpeg, SQLite, and the PHP interpreter. Behavioral tests are generated via agent-driven fuzzing without prescribing implementation structure. Evaluating 9 language models, none fully resolved any task — the best model passed 95% of tests on only 3% of tasks. A notable finding is that models tend to produce monolithic single-file implementations that diverge significantly from human-written code.
Sort: