Hypura is a storage-tier-aware LLM inference scheduler for Apple Silicon that enables running models larger than physical memory by intelligently placing tensors across GPU, RAM, and NVMe tiers. It supports three inference modes: full-resident (model fits in GPU+RAM), expert-streaming for MoE models like Mixtral (exploiting

7m read timeFrom github.com
Post cover image
Table of contents
Why does this matter?How it worksPerformanceInstallQuick startOllama-compatible serverArchitectureFAQSafety notesLicenseEthics
1 Comment

Sort: