Running a big model on a small laptop. Contribute to danveloper/flash-moe development by creating an account on GitHub.

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

Flash-MoE is a pure C/Metal inference engine that runs Qwen3.5-397B-A17B, a 397-billion parameter Mixture-of-Experts model, on a MacBook Pro with 48GB RAM at 4.4+ tokens/second. The 209GB model streams from SSD using parallel pread() calls, with only the 4 active experts per layer loaded at a time. Key optimizations include FMA-optimized dequantization kernels (12% speedup), Apple Accelerate BLAS for linear attention (64% faster), deferred GPU command execution for CPU/GPU overlap, and trusting the OS page cache instead of custom caching (which outperformed every custom approach tested). The project documents 58 experiments, detailing what worked and what didn't, including failed attempts at LZ4 compression, prefetching, speculative decoding, and custom Metal LRU caches.

GitHub - danveloper/flash-moe: Running a big model on a small laptop