Mixture of experts (MoE) layers are utilized to improve the performance of transformer models, particularly large language models (LLMs). MoE layers consist of sparse MoE layers that replace dense feed-forward layers and routers that determine the allocation of tokens to experts. The routing mechanism typically utilizes a softmax gating function. MoE models are popular for LLMs due to their ability to increase model capacity without significantly increasing computational costs. They achieve this by selectively activating a subset of experts during inference.
Sort: