TinyGPT-V is a compact multimodal large language model that combines a pretrained EVA vision transformer, a Q-Former from BLIP-2, projection layers, and the 2.7B-parameter Phi-2 LLM backbone to process both images and text. By keeping most components frozen and training only projection layers, normalization layers, and LoRA weights, the model stays small and efficient. Despite having only 2.8B parameters, TinyGPT-V achieves performance comparable to much larger models (9B–13B) across multiple vision-language benchmarks. Training proceeds in four stages: warm-up, pre-training with LoRA, instruction learning, and multi-task learning.

5m watch time

Sort: