A deep dive into an unexpected 4x performance regression caused by mismatched `bl`/`ret` instruction pairs in AArch64 assembly. The author attempted to eliminate a branch in an inner loop by reusing a single `bl` to set up the link register, but this confused the CPU's return branch predictor, which maintains an internal stack of expected return addresses. Performance counters confirmed 93% branch misprediction rate. The fix is using `br x30` instead of `ret` when the call/return pairing is asymmetric. The post also explores further optimizations: inlining, Rust's iterator sum, and finally hand-written SIMD with loop unrolling achieving 8.8x speedup over the baseline.

10m read timeFrom mattkeeter.com
Post cover image
Table of contents
Do Not Taunt Happy Fun Branch PredictorAppendix: Going Fast

Sort: