Vision Language Models (Better, Faster, Stronger)
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
This post reviews the developments in vision language models over the past year, highlighting new model architectures, specialized capabilities, and emerging paradigms. It covers trends such as any-to-any models, reasoning models, smaller yet capable models, and multimodal safety models, offering insights into how these innovations are shaping the future of AI.
Table of contents
Motivation Table of Contents New model trends Any-to-any models Reasoning Models Smol yet Capable Models Mixture-of-Experts as Decoders Vision-Language-Action Models Specialized Capabilities Object Detection, Segmentation, Counting with Vision Language Models Multimodal Safety Models Multimodal RAG: retrievers, rerankers Multimodal Agents Video Language Models New Alignment Techniques for Vision Language Models New benchmarks MMT-Bench MMMU-Pro Useful Resources MotivationTable of ContentsNew model trendsAny-to-any modelsSpecialized CapabilitiesMultimodal AgentsVideo Language ModelsNew Alignment Techniques for Vision Language ModelsNew benchmarksUseful ResourcesSort: