Flash-MoE is a pure C/Metal inference engine that runs Qwen3.5-397B-A17B, a 397-billion parameter Mixture-of-Experts model, on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. The 209GB model streams from SSD using parallel pread() calls, with only the 4 active experts per layer loaded at a time. Key optimizations include
Table of contents
ResultsHardwareArchitectureQuick StartProject StructureWhat We Tried (and What Worked)SafetySort: