Google's Gemma 4 E4B model can run locally on a modern Android phone using llama.cpp via Termux, turning the device into an OpenAI-compatible LLM server. On an Oppo Find N5 with Snapdragon 8 Elite and 16GB RAM, the setup achieves 7-8 tokens/second with sub-second first-token latency. The configuration supports vision (image description), audio transcription, and tool calls, making it usable for smart home automation tasks like package detection via Home Assistant. Key setup details include using a Q4_K_M quant for the main model and a BF16 multimodal projector — quantizing the mmproj degrades vision/audio quality. While not a replacement for a desktop GPU setup, it demonstrates how far on-device inference has come.
Table of contents
The phone is doing all the thinkingVision is even better than transcriptionIt's not replacing my local LLMSort: