LatentVLA is a novel architecture for autonomous driving that avoids natural language reasoning entirely. Instead, it learns discrete ego-centric latent actions from unlabeled driving data using a self-supervised encoder-decoder framework inspired by LAPO, with a VQ-VAE to discretize continuous action vectors. A Qwen2.5-VL (3.8B) model is trained to predict these latent actions, then distilled into a compact 50M-parameter decision transformer for real-time use. The approach integrates VLM knowledge into existing end-to-end architectures (iPad, Transfuser) via a fusion module using cross-attention in Bird's-Eye-View space. Evaluated on NavSim, LatentVLA achieves state-of-the-art results, though performance gains over baselines are modest. The author notes that open-loop evaluation has significant limitations and argues closed-loop testing would likely reveal larger advantages for reasoning-based approaches.

8m read timeFrom towardsdatascience.com
Post cover image
Table of contents
Latent Action LearningVLM TrainingKnowledge DistillationEvaluationThe limitations of open-source planningConclusionThank you for reading this far!References

Sort: