Support for NVIDIA’s NVFP4 format also allows larger models to run under tighter hardware constraints.

The New Stack is a publication covering trends and technologies in cloud-native development, DevOps, and software delivery. Developers can learn about containerization, Kubernetes, and cloud computing, as well as explore topics such as microservices architecture, serverless computing, and continuous integration/continuous delivery (CI/CD) pipelines.

The New Stack

Ollama's latest release integrates Apple's MLX framework to accelerate local LLM inference on Apple Silicon Macs, leveraging the shared CPU/GPU memory architecture to reduce latency and improve throughput. The update also adds support for NVIDIA's NVFP4 low-precision format, enabling larger models to run under tighter memory constraints. Currently MLX support is limited to the Qwen3.5-35B-A3B model, with more expected. The release is framed in the context of growing demand for local AI agents like OpenClaw, where running models locally offers data control and cost savings, though typically at slower speeds than remote APIs.

Ollama taps Apple’s MLX framework to make local AI models faster on Macs

OpenClaw and the shift toward local agents and models