PyTorch has released the ExecuTorch MLX Delegate, an experimental backend that enables GPU-accelerated inference for PyTorch models on Apple Silicon Macs using Apple's MLX framework. It integrates with the PyTorch 2 export stack via torch.export and supports quantization options including BF16, FP16, FP32, and 2/4/8-bit affine quantization via TorchAO. The delegate delivers 3-6x higher throughput compared to existing ExecuTorch backends on macOS. Validated models include dense transformers (Llama 3.2, Qwen 3, Gemma 3, Phi-4 mini), sparse Mixture-of-Experts (Qwen 3.5 35B-A3B), and speech-to-text models (Whisper, Parakeet, Voxtral) for both offline and real-time transcription. The workflow follows the standard ExecuTorch pipeline: export with torch.export, lower with MLXPartitioner, and run the resulting .pte file.

5m read timeFrom pytorch.org
Post cover image
Table of contents
What is the MLX Delegate?Why Build This as an ExecuTorch Delegate?Quantization and Dtype SupportWhat Models Can I Run?

Sort: