Coding agents working from code alone generate shallow hypotheses. Adding a research phase — arxiv papers, competing forks, other backends — produced 5 kernel fusions that made llama.cpp CPU inference 15% faster.

SkyPilot

A research-driven approach to AI coding agents is explored by adding a literature search phase to the autoresearch loop before running code experiments. Pointed at llama.cpp's CPU inference path with 4 cloud VMs and Claude Code, the agent first read arxiv papers and studied competing forks (ik_llama.cpp, llamafile) before writing any code. This research phase led to 5 successful optimizations out of 30+ experiments — primarily kernel fusions targeting flash attention's QK tile, RMS norm, and softmax — achieving +15% text generation throughput on x86 and +5% on ARM for TinyLlama 1.1B, at a total cost of ~$29 over ~3 hours. Key insight: code-only agents generate shallow hypotheses when the optimization surface isn't visible in the source; domain knowledge from papers and competing implementations is essential for memory-bound inference workloads.

Research-Driven Agents: What Happens When Your Agent Reads Before It Codes