Yann LeCun's $1B Bet Against LLMs [Part 2]
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
Yann LeCun's JEPA (Joint Embedding Predictive Architecture) framework is examined as an alternative to the dominant VLA (Vision-Language-Action) approach for AI and robotics. The video covers VJEPA 2, a Meta model trained on 1 million hours of video without language supervision that achieves state-of-the-art video understanding benchmarks when paired with a language model. It also explores VLJA, a JEPA-based vision-language model that outperforms 7B parameter models using only 1.6B parameters by predicting text embeddings rather than raw tokens. LeCun's two main critiques of VLA models are detailed: (1) behavioral cloning is brittle and unscalable, and (2) VLAs lack explicit planning and world models. The JEPA alternative learns an action-conditioned world model that enables explicit planning via methods like the cross-entropy method, demonstrated on the Push-T task. Hierarchical world models are proposed as LeCun's solution to long-horizon planning. While JEPA-based approaches show theoretical advantages and promising early results, their demonstrated performance on robotics tasks remains significantly behind current VLA systems like Physical Intelligence's PI07.
Sort: