In this video we explain how TinyGPT-V model was built, by reviewing its presenting research paper: "TinyGPT-V: Efficient Multimodal Large Language Model via Small Backbones".

TinyGPT-V is a small multimodal large language model (MLLM) that is based on Phi-2 as its backbone LLM. By being based on Phi-2, TinyGPT-V has only 2.8B params which makes it smaller comparing to other MLLMs that are based on larger LLMs, yet TinyGPT-V is able to achieve impressive results that are comparable to much larger MLLMs in various vision-language tasks.

Paper page - https://arxiv.org/abs/2312.16862
Project page - https://github.com/DLYuanGod/TinyGPT-V
Post - https://aipapersacademy.com/tinygpt-v/

-----------------------------------------------------------------------------------------------
✉️ Join the newsletter - https://aipapersacademy.com/newsletter/

👍 Please like & subscribe if you enjoy this content

We use VideoScribe to edit our videos - https://tidd.ly/44TZEiX (affiliate)
-----------------------------------------------------------------------------------------------

Chapters:
0:00 TinyGPT-4 Motivation
1:17 Model Architecture
3:40 Training Process
4:45 Results

AI Papers Academy

TinyGPT-V is a compact multimodal large language model that combines a pretrained EVA vision transformer, a Q-Former from BLIP-2, projection layers, and the 2.7B-parameter Phi-2 LLM backbone to process both images and text. By keeping most components frozen and training only projection layers, normalization layers, and LoRA weights, the model stays small and efficient. Despite having only 2.8B parameters, TinyGPT-V achieves performance comparable to much larger models (9B–13B) across multiple vision-language benchmarks. Training proceeds in four stages: warm-up, pre-training with LoRA, instruction learning, and multi-task learning.

TinyGPT-V: Small but Mighty Multimodal Large Language Model