An agent written RMSNorm kernel hit 1.88x speedups on H100s. A finetuned Qwen3 0.6B hit 35% on LiveCodeBench. Neither result required a systems engineer. Just coding agents with the right skills loaded.

Ben Burtenshaw from Hugging Face walks through three levels: using Claude Code interactively to write and benchmark CUDA kernels distributed as versioned repos on the Hub, a zero-shot task where an agent finetunes a model end to end from a single prompt, and a multi agent research lab running parallel experiments overnight on Hub compute while a reporter agent pushes results to a live Trackio dashboard. The through line is skills: file based context that turns a zero shot failure into a few shot workflow. CUDA programming and ML training pipelines were deep specializations that took years. Skills compress that timeline to hours.

Speaker info:
- https://x.com/ben_burtenshaw
- https://www.linkedin.com/in/ben-burtenshaw/
- https://github.com/burtenshaw

AI Engineer

A conference talk by Ben Burtenshaw from Hugging Face exploring how coding agents can tackle AI systems engineering tasks. Three progressively complex use cases are presented: (1) using agents to write and distribute optimized CUDA kernels via the Hugging Face kernels library, achieving a 94% speedup on Qwen 3 8B for H100; (2) zero-shot LLM fine-tuning where an agent trains a model on the Hugging Face Hub; and (3) a multi-agent AutoLab setup inspired by Karpathy's Auto Research project, where specialized agents (researcher, planner, workers, reporter) autonomously run ML experiments, track results via Trackio, and iterate on training scripts. Key takeaways: agents work best with open primitives and well-exposed APIs rather than abstracted ones, and the Hugging Face Hub now provides the storage, compute, and tracking infrastructure needed for agentic ML workloads.

Your Coding Agent Should Do AI System Engineering — Ben Burtenshaw, Hugging Face