This post explores the findings of the 'Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer' paper and its implementation in Mixtral. It discusses the concept of token-level mixture of experts, the use of sparse matrices in the gating function, and the optimization of expert usage through the loss
Table of contents
Token-Level Mixture of ExpertsConditional Computation & Sparsely Gated Mixture of ExpertsGating FunctionOptimizing the Loss Function to Balance Expert UsageGetting Enough Training Data to the ExpertsMixtral’s Implementation and GrokClosing ThoughtsSort: