Hugging Face introduces Smol2Operator, a comprehensive approach for training lightweight vision-language models to perform GUI automation tasks. The methodology transforms a base model with zero grounding capabilities into an agentic GUI coder through a two-phase training process. Phase 1 establishes GUI grounding using action-instruction pairs, while Phase 2 develops agentic reasoning capabilities. The approach achieves 61% accuracy on ScreenSpot-v2 benchmark and includes complete open-source training recipes, datasets, preprocessing tools, and the resulting model to enable full reproducibility.
Table of contents
Table of ContentsIntroduction1. Data Transformation and Unified Action Space2. Phase 1: From Zero to Perception3. Phase 2: From Perception to Cognition4. All you need is Open Source5. ConclusionWhat's Next?Sort: