Towards Data Science is a community-powered publication that showcases work in data science, machine learning and artificial intelligence. Every day newcomers, seasoned researchers and industry practitioners publish tutorials, research notes and real-world case studies that help the field move forward.

Towards Data Science

Visual-Language-Action (VLA) models unify perception, language understanding, and robotic control into a single learned system. The post covers the mathematical foundations of VLAs, including how robots represent observations as latent embeddings, three strategies for generating continuous actions (action tokenization, diffusion-based heads, and flow matching), and real-world architectures from models like RT-2, OpenVLA, GR00T N1, π0, and Figure's Helix 02. It also explains the two-phase training pipeline: large-scale pretraining on diverse robot demonstration datasets followed by embodiment-specific post-training for task specialization and fine motor control.

How Visual-Language-Action (VLA) Models Work