In this video we dive into V-JEPA, a new vision models collection, created by Meta AI. V-JEPA stands for Video Joint-Embedding Predictive Architecture, and is part of the Meta AI's implementation of Yann LeCun's vision for a more human-like AI. In this video we dive deep into the researcher paper which presented V-JEPA, titled: "Revisiting Feature Prediction for Learning Visual Representations from Video". Additionally, we provide reminders for important information from I-JEPA, a previous Meta AI's JEPA model which is based on images, which will help to grasp how JEPA works for videos as well.
We start with a short background of what is the meaning of visual representations, also known as visual features or semantic embeddings. V-JEPA is trained using unsupervised learning using feature prediction, so we provide a short background for what is the meaning of feature prediction, which is different than pixels prediction. By then, we are ready to cover the JEPA framework, starting with the main idea, following with the details of both images with I-JEPA and videos with V-JEPA.

Both I-JEPA and V-JEPA models are based on Vision Transformers, which we may assume that you are familiar with in the video. We covered the details of vision transformers in the following video - https://youtu.be/NetSJM590Lo

We also have a previous video dedicated solely to I-JEPA with more details on the I-JEPA paper which we do not cover here - https://youtu.be/6bJIkfi8H-E

-----------------------------------------------------------------------------------------------
Paper page - https://ai.meta.com/research/publications/revisiting-feature-prediction-for-learning-visual-representations-from-video/

Meta AI's V-JEPA blog post - https://ai.meta.com/blog/v-jepa-yann-lecun-ai-model-video-joint-embedding-predictive-architecture/

Code - https://github.com/facebookresearch/jepa

Blog post - https://aipapersacademy.com/v-jepa/
-----------------------------------------------------------------------------------------------
✉️ Join the newsletter - https://aipapersacademy.com/newsletter/

👍 Please like & subscribe if you enjoy this content

Become a patron - https://www.patreon.com/aipapersacademy

We use VideoScribe to edit our videos - https://tidd.ly/44TZEiX 
-----------------------------------------------------------------------------------------------

Chapters:
0:00 Introduction
1:01 Visual Representations
2:42 Feature Prediction
4:12 JEPA Framework
5:55 I-JEPA Details
8:56 V-JEPA Details
10:52 V-JEPA Results

AI Papers Academy

V-JEPA (Video Joint-Embedding Predictive Architecture) is Meta AI's self-supervised video model that learns visual representations by predicting masked spatiotemporal regions in feature space rather than pixels. Building on I-JEPA for images, V-JEPA uses a context encoder, predictor, and target encoder (updated via moving average to prevent collapse). Targets share the same spatial area across all video frames, making prediction harder and forcing the model to learn richer semantics. V-JEPA outperforms other models on motion-based benchmarks (Something-Something-v2) and is competitive on appearance-based tasks (Kinetics-400), demonstrating the effectiveness of feature prediction over pixel prediction for video representation learning.

V-JEPA by Meta AI - A Human-Like Computer Vision Video-based Model