Apple researchers propose a method to run large language models on devices with insufficient DRAM by intelligently loading model weights from flash memory on demand. Key techniques include: loading only needed feed-forward network weights (attention weights stay resident), a 'windowing' approach that reuses active neurons across sliding input windows to minimize data transfers, and 'bundling' of column/row weights in flash storage to increase read chunk sizes and throughput. Results show over 20x speedup compared to naive flash loading on GPU, bringing single inference latency under 100ms. This work is a step toward running high-end LLMs on memory-constrained devices like smartphones.

6m watch time

Sort: