This post discusses the use of Visual LLMs, specifically LLaVA-1.6, in large-scale multi-label image classification pipelines. It covers the pros and cons of using CLIP models for zero-shot image classification and compares LLaVA's performance to the ViT-B-32 CLIP model on an age classification task. It also explores how to fit
Table of contents
A little context on CLIPUsing LLaVA-34B for multi-label classificationNow how do we fit the classification into a pipeline?Batch Inference on LLaVASort: