Apple researchers have introduced Ferret-UI Lite, a compact 3B-parameter multimodal model designed to run on-device and interact with graphical user interfaces across mobile, web, and desktop platforms. Unlike existing GUI agents that rely on large foundation models like GPT or Gemini, Ferret-UI Lite prioritizes low latency, privacy, and offline capability. The model uses screen image cropping, chain-of-thought reasoning, and a two-stage training pipeline combining supervised fine-tuning and reinforcement learning with verifiable rewards (RLVR). It achieves 91.6% on the ScreenSpot-V2 GUI grounding benchmark, competitive with much larger models. Limitations include struggles with long-horizon multi-step tasks and sensitivity to reward design. The model could enable Apple to reduce Siri's dependence on Google Cloud.
Sort: