On Apple Silicon, a WebAssembly module's linear memory can be shared directly with the GPU using zero-copy techniques, eliminating the expensive serialization boundary that normally exists between VM sandboxes and hardware accelerators. The approach chains three components: mmap for page-aligned memory, Metal's makeBuffer(bytesNoCopy:) to wrap that pointer as a GPU buffer without copying, and Wasmtime's MemoryCreator trait to use that same mmap region as Wasm linear memory. Benchmarks on an M1 MacBook Pro show RSS overhead of only 0.03 MB vs 16.78 MB for the copy path, with identical compute latency. The author then ran Llama 3.2 1B Instruct through this pipeline via Apple's MLX framework, achieving ~9ms per-token generation with negligible Wasm-to-GPU dispatch overhead. A key capability enabled is portable KV cache snapshots: serializing and restoring conversation context is 5.45× faster than re-prefill at 24 tokens, with the advantage growing linearly with context length. This forms the foundation of 'Driftwood', a runtime for stateful Wasm actors with GPU inference, enabling frozen conversation state to be moved across machines.
Table of contents
Why this is normally hardThe three-link chainWhat I measuredFrom zero-copy to inferenceKV cache portabilityWhat's being builtSort: