Meta's Perception Language Models (PLMs) are fully open-source vision language models built without relying on closed-source proprietary models for training data. The architecture combines Llama 3 as the base LLM with a Perception Encoder (PE) and a two-layer MLP projector to handle text, images, and video inputs. Training proceeds in three stages: a warm-up phase training only the projector, a mid-training phase jointly training all components on 64.7M samples, and a final supervised fine-tuning stage on 14M human-annotated samples including challenging video tasks with event timestamps. Dynamic tiling enables high-resolution image processing. PLMs achieve state-of-the-art results on image captioning and hard perception benchmarks, and competitive results on video benchmarks, all without distilling from proprietary models.
Sort: