Gunnar Morling

Java 16's incubating Vector API (JEP 338, Project Panama) enables developers to write explicitly vectorized SIMD code without needing low-level CPU intrinsics. Using FizzBuzz as a worked example, the post walks through implementing a SIMD-based solution with IntVector and blend() operations, pre-computing masks for divisibility patterns, and processing array chunks in parallel lanes. Benchmarks on a 2.6 GHz Intel i7 show ~4x throughput improvement over scalar code. The post also examines the generated native assembly (vpblendvb AVX2 instruction), explores masked tail-element handling (which surprisingly hurts performance), and tests on Apple M1/AArch64 where 256-bit vectors fall back to scalar due to NEON's 128-bit limit. Key takeaway: SIMD isn't always faster, benchmarking is essential, and the Vector API democratizes SIMD for mainstream Java developers.

FizzBuzz – SIMD Style!