Run models too big for your Mac's memory. Contribute to t8/hypura development by creating an account on GitHub.

Hacker News is a community-driven platform for sharing and discussing technology news, startups, and programming-related topics. Through user submissions and comments, Hacker News offers insights into emerging technology trends, industry developments, and entrepreneurial ventures. Readers can participate in discussions, share their insights, and stay informed about the latest advancements in technology and innovation.

Hacker News

Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that enables running models larger than physical memory by intelligently placing tensors across GPU, RAM, and NVMe tiers. It supports three inference modes: full-resident (model fits in GPU+RAM), expert-streaming for MoE models like Mixtral (exploiting sparsity so only 2/8 experts fire per token, achieving 99.5% neuron cache hit rate), and dense FFN-streaming for large dense models like Llama 70B. On an M1 Max 32 GB machine, it runs a 31 GB Mixtral at 2.2 tok/s and a 40 GB Llama 70B at 0.3 tok/s — both of which crash vanilla llama.cpp with OOM. It exposes an Ollama-compatible HTTP API, builds with Cargo (Rust), and requires no manual tuning as pool sizes and prefetch depth are computed automatically from hardware profiling.

GitHub - t8/hypura: Run models too big for your Mac's memory