FP8 training requires sophisticated scaling strategies to maintain numerical stability and accuracy. Per-tensor scaling assigns unique scaling factors to each tensor, with delayed scaling using historical data and current scaling adapting in real-time. Per-block scaling divides tensors into smaller segments with individual
Table of contents
Per-tensor scalingPer-tensor delayed scalingPer-tensor current scalingPer-block scalingWhat is Micro-Scaling FP8?How does MXFP8 work?Block scalingHow does block scaling work?Recipes in the NVIDIA NeMo FrameworkSort: