Local LLMs have come so far that you can now run one on your phone.

XDA Developers

Google's Gemma 4 E4B model can run locally on a modern Android phone using llama.cpp via Termux, turning the device into an OpenAI-compatible LLM server. On an Oppo Find N5 with Snapdragon 8 Elite and 16GB RAM, the setup achieves 7-8 tokens/second with sub-second first-token latency. The configuration supports vision (image description), audio transcription, and tool calls, making it usable for smart home automation tasks like package detection via Home Assistant. Key setup details include using a Q4_K_M quant for the main model and a BF16 multimodal projector — quantizing the mmproj degrades vision/audio quality. While not a replacement for a desktop GPU setup, it demonstrates how far on-device inference has come.

I turned my phone into a local LLM server, and it handles vision, voice, and tool calls