Towards Data Science is a community-powered publication that showcases work in data science, machine learning and artificial intelligence. Every day newcomers, seasoned researchers and industry practitioners publish tutorials, research notes and real-world case studies that help the field move forward.

Towards Data Science

A practical deep dive into how text-only language models are extended to understand images, covering the three core components: a frozen Vision Transformer (ViT) image backbone, a Q-Former adapter layer trained with contrastive/matching/generation losses to bridge image and text embeddings, and a small LLM (SmolLM2-135M) fine-tuned with LoRA adapters. The author walks through the full training pipeline with open-source code, explaining architectural choices like frozen backbones, cross-attention layers, learnable query embeddings, and how image tokens are interleaved with text tokens for autoregressive generation.

How Vision Language Models Are Trained from “Scratch”