In this video we review a recent important paper titled: "Fast Inference of Mixture-of-Experts Language Models with Offloading".
Mixture of Experts (MoE) is an important strategy to improve the efficiency of transformer based large language models (LLMs) nowadays.
However, MoE models usually have a large memory footprint since we need to load the weights of all experts. This makes it hard to run MoE models on low tier GPUs. 
This paper introduces a method to efficiently run transformer based MoE LLMs on a limited memory environment using offloading techniques. Specifically, the researchers are able to run Mixtral-8x7B on the free-tier version of Google Colab.
In the video, we provide a reminder for how mixture of experts works, and then dive into the offloading method presented in this paper.

-----------------------------------------------------------------------------------------------
Paper page - https://arxiv.org/abs/2312.17238
Soft MoE - https://youtu.be/rae0Eal8ZHA
Code - https://github.com/dvmazur/mixtral-offloading
Post - https://aipapersacademy.com/moe-offloading/
Original Mixture-of-Experts paper review - https://aipapersacademy.com/mixture-of-experts/
-----------------------------------------------------------------------------------------------
✉️ Join the newsletter - https://aipapersacademy.com/newsletter/

👍 Please like & subscribe if you enjoy this content

We use VideoScribe to edit our videos - https://tidd.ly/44TZEiX (affiliate)
-----------------------------------------------------------------------------------------------
Chapters:
0:00 Paper Introduction
1:34 Mixture of Experts
3:44 MoE Offloading
10:29 Mixed MoE Quantization
11:13 Inference Speed

AI Papers Academy

A research paper presents a method for efficiently running Mixture-of-Experts (MoE) language models on hardware with limited GPU memory. The approach combines two key techniques: an LRU cache to retain recently activated expert weights across token generation steps, and speculative expert loading that predicts which experts will be needed in future layers using earlier layer outputs. Applied to Mixtral-8x7B with mixed quantization (4-bit attention, 2-3 bit experts), the method achieves 2-3 tokens per second on low-tier GPUs and 2 tokens/sec on free-tier Google Colab, compared to 0.6 tokens/sec with naive offloading. Cache hit rates reach ~40-60% with LRU alone, and speculative loading pushes correct prefetch rates above 80-90%.