A hands-on guide to fine-tuning Mixture of Experts (MoE) models using fms-hf-tuning on Red Hat OpenShift AI. Covers two key optimizations: MoE scatter-mode kernels (fast_moe) for memory savings and throughput, and expert parallelism for distributing experts across GPUs. Walks through full fine-tuning and LoRA workflows for IBM Granite 4.0 tiny/small hybrid MoE models using Kubeflow Trainer v2, including dataset preparation, training job configuration with FSDPv2, LoRA adapter merging, and serving the fine-tuned model via vLLM on OpenShift AI. Achieves 590 tokens/GPU/second on 4x A100 GPUs.

Table of contents
Open source tuning with fms-hf-tuningMixture of experts kernels, and expert parallelismMoE kernelsExpert parallelismTuning Granite 4 tiny and small MoE hybrid modelsPrerequisitesPrepare the datasetRunning a training jobRunning a training job: Full fine-tuningRunning a training job: LoRAServe the fine-tuned Granite 4.0 modelsTry fms-hf-tuningSort: