Abstract page for arXiv paper 2605.03546: ProgramBench: Can Language Models Rebuild Programs From Scratch?

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

ProgramBench is a new benchmark evaluating whether LLM-based software engineering agents can rebuild entire programs from scratch given only a reference executable and its documentation. Unlike existing benchmarks that focus on narrow tasks like bug fixes or single features, ProgramBench requires agents to make holistic software architecture decisions. The 200 tasks span compact CLI tools to major projects like FFmpeg, SQLite, and the PHP interpreter. Behavioral tests are generated via agent-driven fuzzing without prescribing implementation structure. Evaluating 9 language models, none fully resolved any task — the best model passed 95% of tests on only 3% of tasks. A notable finding is that models tend to produce monolithic single-file implementations that diverge significantly from human-written code.

[2605.03546] ProgramBench: Can Language Models Rebuild Programs From Scratch?