AutoSP is a compiler-based solution built on top of DeepSpeed's DeepCompile ecosystem that automatically converts single-GPU transformer training code into multi-GPU sequence parallel code for long-context LLM training. It implements DeepSpeed-Ulysses as its SP strategy and introduces Sequence-aware Activation Checkpointing (SAC) to handle memory constraints at 100k+ token counts. Users enable it by adding a few lines to their DeepSpeed config and calling a utility function to tag inputs — no invasive code changes required. Benchmarks on Llama 3.1 models on 8×A100-80GB show increased maximum trainable sequence length with minimal runtime overhead compared to hand-written baselines like RingFlashAttention and ZeRO-3. Key limitations include requiring the full model to be compiled as a single artifact and no support for graph breaks.

6m read timeFrom pytorch.org
Post cover image
Table of contents
AutoSP UsageAutoSP Compiler PassesEvaluating AutoSP on Real ModelsLimitationsConclusion

Sort: