V-JEPA (Video Joint-Embedding Predictive Architecture) is Meta AI's self-supervised video model that learns visual representations by predicting masked spatiotemporal regions in feature space rather than pixels. Building on I-JEPA for images, V-JEPA uses a context encoder, predictor, and target encoder (updated via moving average to prevent collapse). Targets share the same spatial area across all video frames, making prediction harder and forcing the model to learn richer semantics. V-JEPA outperforms other models on motion-based benchmarks (Something-Something-v2) and is competitive on appearance-based tasks (Kinetics-400), demonstrating the effectiveness of feature prediction over pixel prediction for video representation learning.
•11m watch time
Sort: