A research-driven approach to AI coding agents is explored by adding a literature search phase to the autoresearch loop before running code experiments. Pointed at llama.cpp's CPU inference path with 4 cloud VMs and Claude Code, the agent first read arxiv papers and studied competing forks (ik_llama.cpp, llamafile) before writing any code. This research phase led to 5 successful optimizations out of 30+ experiments — primarily kernel fusions targeting flash attention's QK tile, RMS norm, and softmax — achieving +15% text generation throughput on x86 and +5% on ARM for TinyLlama 1.1B, at a total cost of ~$29 over ~3 hours. Key insight: code-only agents generate shallow hypotheses when the optimization surface isn't visible in the source; domain knowledge from papers and competing implementations is essential for memory-bound inference workloads.
Table of contents
Where code-only context works #Where code-only context breaks down #Adding a research phase #The experiment log #What didn’t work #What this means for coding agents #Try it on your own project #Sort: