Arm SME2 (Scalable Matrix Extension 2) delivers up to 3.9x speedup for on-device ML inference when running image segmentation models like SqueezeSAM through PyTorch's ExecuTorch runtime. On a single CPU core, INT8 inference improves by 1.83x (556ms to 304ms) and FP16 by 3.9x (1,163ms to 298ms), making FP16 nearly as fast as
Table of contents
The Stack: PyTorch, ExecuTorch, XNNPACK, Arm KleidiAI, and SME2Results: INT8 and FP16 (1 CPU core vs 4 CPU cores)Three Insights from End-to-End and Operator-Level ResultsHands-On Example: Reproducing the WorkflowConclusion: What SME2 Changes in PracticeSort: