Flash-MoE is a pure C/Metal inference engine that runs Qwen3.5-397B-A17B, a 397-billion parameter Mixture-of-Experts model, on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. The 209GB model streams from SSD using parallel pread() calls, with only the 4 active experts per layer loaded at a time. Key optimizations include FMA-optimized dequantization kernels (12% speedup), Apple Accelerate BLAS for linear attention (64% faster), deferred GPU command execution for CPU/GPU overlap, and trusting the OS page cache instead of custom caching (which outperformed every custom approach tested). The project documents 58 experiments, detailing what worked and what didn't, including failed attempts at LZ4 compression, prefetching, speculative decoding, and custom Metal LRU caches.
Table of contents
ResultsHardwareArchitectureQuick StartProject StructureWhat We Tried (and What Worked)SafetySort: