Mixture-of-Experts (MoE) is a technique that increases model capacity without proportional computational cost by routing each input token to only a subset of specialized sub-networks called experts. Originally introduced in a 2017 Google paper ('Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer') — predating Transformers and co-authored by Geoffrey Hinton — MoE uses a gating/router component to select which experts process each token. Only the chosen experts run for a given input, enabling parallel execution and reduced compute. The outputs of selected experts are combined via a weighted sum determined by the gating network, and all components are trained jointly.

4m watch time

Sort: