Arm SME2 (Scalable Matrix Extension 2) delivers up to 3.9x speedup for on-device ML inference when running image segmentation models like SqueezeSAM through PyTorch's ExecuTorch runtime. On a single CPU core, INT8 inference improves by 1.83x (556ms to 304ms) and FP16 by 3.9x (1,163ms to 298ms), making FP16 nearly as fast as

15m read timeFrom pytorch.org
Post cover image
Table of contents
The Stack: PyTorch, ExecuTorch, XNNPACK, Arm KleidiAI, and SME2Results: INT8 and FP16 (1 CPU core vs 4 CPU cores)Three Insights from End-to-End and Operator-Level ResultsHands-On Example: Reproducing the WorkflowConclusion: What SME2 Changes in Practice

Sort: