A research paper presents a method for efficiently running Mixture-of-Experts (MoE) language models on hardware with limited GPU memory. The approach combines two key techniques: an LRU cache to retain recently activated expert weights across token generation steps, and speculative expert loading that predicts which experts will be needed in future layers using earlier layer outputs. Applied to Mixtral-8x7B with mixed quantization (4-bit attention, 2-3 bit experts), the method achieves 2-3 tokens per second on low-tier GPUs and 2 tokens/sec on free-tier Google Colab, compared to 0.6 tokens/sec with naive offloading. Cache hit rates reach ~40-60% with LRU alone, and speculative loading pushes correct prefetch rates above 80-90%.

11m watch time

Sort: