Speeding Up Variable-Length Training with Dynamic Context Parallelism and NVIDIA Megatron Core

Dynamic Context Parallelism (Dynamic-CP) is a scheduling approach in NVIDIA Megatron Core that addresses computational inefficiencies in training large language models and diffusion transformers with variable-length sequences. Unlike static context parallelism that fixes CP size to the longest sequence in a batch, Dynamic-CP adaptively selects CP size per microbatch based on sequence packing strategies. The system uses a solver that models compute and communication costs to optimize packing and CP sizing while respecting GPU memory constraints. Framework modifications include building multiple CP groups per rank, dynamic rescheduling with THD layout, and asynchronous solver execution to avoid training overhead. Benchmarks show 1.48x speedup on GitHub datasets and over 35% improvement in industrial multi-GPU environments by reducing data-parallel imbalances and unnecessary communication overhead.

#machine-learning

#performance

#llm

#nvidia

#distributed-systems

Jan 28•11m read time•From developer.nvidia.com

Table of contents

Megatron Core framework modifications for supporting Dynamic-CP Data scheduler modeling Collaboration of cost model, solver, and simulator Modeling process and bi-objective balance Zero-overhead execution Benchmark results Learn more

Comment

Bookmark

Copy

Sort: