Hugging Face introduces Smol2Operator, a comprehensive approach for training lightweight vision-language models to perform GUI automation tasks. The methodology transforms a base model with zero grounding capabilities into an agentic GUI coder through a two-phase training process. Phase 1 establishes GUI grounding using action-instruction pairs, while Phase 2 develops agentic reasoning capabilities. The approach achieves 61% accuracy on ScreenSpot-v2 benchmark and includes complete open-source training recipes, datasets, preprocessing tools, and the resulting model to enable full reproducibility.

12m read timeFrom huggingface.co
Post cover image
Table of contents
Table of ContentsIntroduction1. Data Transformation and Unified Action Space2. Phase 1: From Zero to Perception3. Phase 2: From Perception to Cognition4. All you need is Open Source5. ConclusionWhat's Next?

Sort: