A talk by the developer of Locally AI demonstrating how to run Gemma 4 on iPhone using Apple's MLX framework, achieving 40 tokens per second on the latest iPhones. Covers the MLX Swift LM GitHub repo for iOS/macOS integration, sourcing quantized models (4-bit to 8-bit) from the MLX Community on Hugging Face, and practical tips on model selection for on-device inference. Also mentions tool calling support, the broader MLX ecosystem (VLM, audio, video), and the recent acquisition of Locally AI by LM Studio.
•10m watch time
Sort: