CLIP is a multimodal AI model developed by OpenAI that uses contrastive learning to create meaningful embeddings for both images and text. Trained on 400 million image-text pairs, it employs separate embedding models for text (Transformer-based) and images (ResNet or Vision Transformer) to learn representations where similar
Table of contents
IntroductionContrastive learningArchitecture & TrainingDetailsAdvantagesApplicationsConclusionResourcesSort: