Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that enables running models larger than physical memory by intelligently placing tensors across GPU, RAM, and NVMe tiers. It supports three inference modes: full-resident (model fits in GPU+RAM), expert-streaming for MoE models like Mixtral (exploiting sparsity so only 2/8 experts fire per token, achieving 99.5% neuron cache hit rate), and dense FFN-streaming for large dense models like Llama 70B. On an M1 Max 32 GB machine, it runs a 31 GB Mixtral at 2.2 tok/s and a 40 GB Llama 70B at 0.3 tok/s — both of which crash vanilla llama.cpp with OOM. It exposes an Ollama-compatible HTTP API, builds with Cargo (Rust), and requires no manual tuning as pool sizes and prefetch depth are computed automatically from hardware profiling.
Table of contents
Why does this matter?How it worksPerformanceInstallQuick startOllama-compatible serverArchitectureFAQSafety notesLicenseEthics1 Comment
Sort: