A developer applies Karpathy's Autoresearch framework to an old ML research project (eCLIP) using Claude Code as the autonomous agent. The setup involves a constrained optimization loop where the agent iteratively modifies training code, runs experiments, and commits or reverts changes based on eval metrics. Over 42 experiments in a single day, the agent reduced mean rank from 344.68 to 157.43 (54% improvement). The biggest win came from the agent spotting a bug in the temperature parameter clamping. Gains diminished significantly in later phases involving architectural changes and moonshot ideas. Key takeaways: sandboxing is essential, the commit-or-revert loop works well for defined search spaces, but LLM agents struggle with 'unknown unknowns' in research.
Sort: