This post explores the findings of the 'Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer' paper and its implementation in Mixtral. It discusses the concept of token-level mixture of experts, the use of sparse matrices in the gating function, and the optimization of expert usage through the loss

6m read timeFrom towardsdatascience.com
Post cover image
Table of contents
Token-Level Mixture of ExpertsConditional Computation & Sparsely Gated Mixture of ExpertsGating FunctionOptimizing the Loss Function to Balance Expert UsageGetting Enough Training Data to the ExpertsMixtral’s Implementation and GrokClosing Thoughts

Sort: