Rust zero-cost abstractions vs. SIMD

A full-text search query on turbopuffer was taking 220ms instead of the expected ~50ms. Profiling revealed that over 60% of runtime was spent in a merge iterator, not in BM25 ranking. The root cause: Rust's zero-cost iterator abstraction, while compiling each individual call efficiently, prevented the compiler from vectorizing or unrolling across calls due to the recursive nature of `next()`. The fix was a classic database technique — batched iterators — where `next_batch()` fills an array of 512 KV pairs at once, giving the compiler a tight inner loop it can auto-vectorize with SIMD. The result: the benchmark dropped from 6.5ms to 110μs (60× faster), and the production query latency fell from 220ms to 47ms. The key lesson: 'zero-cost' means the abstraction compiles away per call, not that it has no effect on the compiler's ability to optimize across calls.

#compiler-optimization

#performance

#rust

Mar 04•12m read time•From turbopuffer.com

Table of contents

Understanding the turbopuffer read path Looking inside the merge iterator Disassembling the abstraction The cost hides beneath the abstraction Breaking the abstraction to find a solution Conclusion turbopuffer