CLIP is a multimodal AI model developed by OpenAI that uses contrastive learning to create meaningful embeddings for both images and text. Trained on 400 million image-text pairs, it employs separate embedding models for text (Transformer-based) and images (ResNet or Vision Transformer) to learn representations where similar

7m read timeFrom towardsdatascience.com
Post cover image
Table of contents
IntroductionContrastive learningArchitecture & TrainingDetailsAdvantagesApplicationsConclusionResources

Sort: