In this video, we dive into Perception Language Models (PLMs), introduced in a recent paper from Meta titled PerceptionLM: Open-Access Data and Models for Detailed Visual Understanding.
While most vision-language models (VLMs) today are either closed or trained via distillation from black-box models, PLMs are fully open-source and trained from scratch, without relying on proprietary systems.
They achieve impressive performance, even setting new state-of-the-art results on image and video benchmarks that require detailed visual understanding.

🔗 Written Review - soon :)
🔗 Paper: https://arxiv.org/abs/2504.13180
🔗 Models & Code: https://github.com/facebookresearch/perception_models
___________________
🔔 Subscribe for more AI paper reviews!

📩 Join the newsletter → https://aipapersacademy.com/newsletter/

Patreon - https://www.patreon.com/aipapersacademy

The video was edited using VideoScribe - https://tidd.ly/44TZEiX
___________________
Chapters:
0:00 Introduction
1:25 PLM Architecture
3:40 PLM Training & Data
7:30 Results

AI Papers Academy

Meta's Perception Language Models (PLMs) are fully open-source vision language models built without relying on closed-source proprietary models for training data. The architecture combines Llama 3 as the base LLM with a Perception Encoder (PE) and a two-layer MLP projector to handle text, images, and video inputs. Training proceeds in three stages: a warm-up phase training only the projector, a mid-training phase jointly training all components on 64.7M samples, and a final supervised fine-tuning stage on 14M human-annotated samples including challenging video tasks with event timestamps. Dynamic tiling enables high-resolution image processing. PLMs achieve state-of-the-art results on image captioning and hard perception benchmarks, and competitive results on video benchmarks, all without distilling from proprietary models.

Perception Language Models (PLMs) by Meta – A Fully Open SOTA VLM