Zero-Copy GPU Inference from WebAssembly on Apple Silicon

On Apple Silicon, a WebAssembly module's linear memory can be shared directly with the GPU using zero-copy techniques, eliminating the expensive serialization boundary that normally exists between VM sandboxes and hardware accelerators. The approach chains three components: mmap for page-aligned memory, Metal's makeBuffer(bytesNoCopy:) to wrap that pointer as a GPU buffer without copying, and Wasmtime's MemoryCreator trait to use that same mmap region as Wasm linear memory. Benchmarks on an M1 MacBook Pro show RSS overhead of only 0.03 MB vs 16.78 MB for the copy path, with identical compute latency. The author then ran Llama 3.2 1B Instruct through this pipeline via Apple's MLX framework, achieving ~9ms per-token generation with negligible Wasm-to-GPU dispatch overhead. A key capability enabled is portable KV cache snapshots: serializing and restoring conversation context is 5.45× faster than re-prefill at 24 tokens, with the advantage growing linearly with context length. This forms the foundation of 'Driftwood', a runtime for stateful Wasm actors with GPU inference, enabling frozen conversation state to be moved across machines.

#rust

#webassembly

Apr 19•8m read time•From abacusnoir.com

Table of contents

Why this is normally hard The three-link chain What I measured From zero-copy to inference KV cache portability What's being built

Comment

Bookmark

Copy

Sort: