Google's Mixture of Nested Experts (MoNE) addresses two key inefficiencies in vision transformers: the large memory footprint of standard Mixture-of-Experts (MoE) and the redundant compute spent on uninformative image patches. MoNE introduces nested experts of varying sizes (full, half, quarter layer weights) within each transformer layer. A router using Expert Preferred Routing (EPR) assigns tokens to experts based on importance — high-information tokens go to the full-capacity expert, while background or redundant tokens are handled by smaller, cheaper experts. Tokens routed to smaller experts use fewer weights in both attention and MLP modules, reducing compute. Results on ImageNet-21k show MoNE achieves comparable accuracy to baselines at significantly lower computational cost.

7m watch time

Sort: