A full-text search query on turbopuffer was taking 220ms instead of the expected ~50ms. Profiling revealed that over 60% of runtime was spent in a merge iterator, not in BM25 ranking. The root cause: Rust's zero-cost iterator abstraction, while compiling each individual call efficiently, prevented the compiler from vectorizing or unrolling across calls due to the recursive nature of `next()`. The fix was a classic database technique — batched iterators — where `next_batch()` fills an array of 512 KV pairs at once, giving the compiler a tight inner loop it can auto-vectorize with SIMD. The result: the benchmark dropped from 6.5ms to 110μs (60× faster), and the production query latency fell from 220ms to 47ms. The key lesson: 'zero-cost' means the abstraction compiles away per call, not that it has no effect on the compiler's ability to optimize across calls.
Table of contents
Understanding the turbopuffer read pathLooking inside the merge iteratorDisassembling the abstractionThe cost hides beneath the abstractionBreaking the abstraction to find a solutionConclusionturbopuffer2 Comments
Sort: