CLIP (Contrastive Language-Image Pre-training) is OpenAI's neural network that learns to recognize images by matching them with text descriptions from 400 million internet image-text pairs. Unlike traditional computer vision models requiring expensive labeled datasets for each task, CLIP achieves zero-shot classification by comparing image embeddings with text embeddings in a shared vector space. The model uses two encoders (image and text) trained with contrastive learning, making it 4-10x more efficient than caption generation approaches. CLIP can classify images into any categories described in natural language without retraining, though it struggles with fine-grained distinctions, spatial reasoning, and inherits biases from internet data. It has become foundational infrastructure for modern AI systems including Stable Diffusion and DALL-E.
Table of contents
The Problem CLIP SolvesThe Technical FoundationZero-Shot Classification in ActionDesign Choices That Made CLIP PossibleConclusionSort: