PyTorch offers insights into deep learning, neural network modeling, and machine learning research, providing documentation, tutorials, and best practices for building and training models with PyTorch framework. By exploring PyTorch's curated content, developers can learn about tensor computations, autograd mechanisms, and model deployment strategies for solving complex problems in computer vision, natural language processing, and reinforcement learning. Whether you're a researcher, practitioner, or enthusiast, PyTorch offers resources to advance your understanding of deep learning and push the boundaries of AI innovation.

PyTorch

MXFP8 and NVFP4 microscaling quantization formats natively supported by NVIDIA's Blackwell architecture (B200 GPUs) deliver significant inference speedups for diffusion models. Using diffusers and TorchAO, benchmarks on Flux.1-Dev, QwenImage, and LTX-2 show up to 1.26x speedup with MXFP8 and up to 1.68x with NVFP4 compared to BF16 baselines, while also reducing peak memory by up to ~3.5x. The post covers selective quantization strategies (skipping small or accuracy-critical layers), CUDA Graphs via torch.compile reduce-overhead mode to cut CPU overhead at small batch sizes, and LPIPS as a perceptual accuracy metric. Code and reproducible recipes are provided for all three models.

Faster Diffusion on Blackwell: MXFP8 and NVFP4 with Diffusers and TorchAO – PyTorch

Optimizing Accuracy and Performance with Selective Quantization