In this video, we dive into a recent research paper by Google, titled: "Mixture of Nested Experts: Adaptive Processing of Visual Tokens". While standard Mixture of Experts (MoE) is successfully applied in LLMs, and also in computer vision, to increase computational cost without a proportional increase to model size, it comes with a large memory footprint. The Mixture of Nested Experts (MoNE) which we review in this video tackles that drawback. Mixture of Nested Experts is built on top of the Vision Transformer (ViT) architecture, and offers a dramatic performance improvement, by leveraging the fact that images naturally contain a large amount of information redundancy. So, while ViT (also with MoE), allocates its full compute power for each token, Mixture of Nested Experts (MoNE) learns to allocate compute power to tokens based on their importance.
Watch the video to learn more.

Paper page - https://arxiv.org/abs/2407.19985
Mixture of Experts (MoE) Video - https://youtu.be/kb6eH0zCnl8
Post -  https://aipapersacademy.com/mixture-of-nested-experts/
Original Mixture-of-Experts paper review - https://aipapersacademy.com/mixture-of-experts/

-----------------------------------------------------------------------------------------------
✉️ Join the newsletter - https://aipapersacademy.com/newsletter/

👍 Please like & subscribe if you enjoy this content
-----------------------------------------------------------------------------------------------

Chapters:
0:00 Introduction
1:20 MoNE Illustration
4:36 MoNE Diagram
5:47 Results

AI Papers Academy

Google's Mixture of Nested Experts (MoNE) addresses two key inefficiencies in vision transformers: the large memory footprint of standard Mixture-of-Experts (MoE) and the redundant compute spent on uninformative image patches. MoNE introduces nested experts of varying sizes (full, half, quarter layer weights) within each transformer layer. A router using Expert Preferred Routing (EPR) assigns tokens to experts based on importance — high-information tokens go to the full-capacity expert, while background or redundant tokens are handled by smaller, cheaper experts. Tokens routed to smaller experts use fewer weights in both attention and MLP modules, reducing compute. Results on ImageNet-21k show MoNE achieves comparable accuracy to baselines at significantly lower computational cost.