This post discusses the use of Visual LLMs, specifically LLaVA-1.6, in large-scale multi-label image classification pipelines. It covers the pros and cons of using CLIP models for zero-shot image classification and compares LLaVA's performance to the ViT-B-32 CLIP model on an age classification task. It also explores how to fit

4m read timeFrom medium.com
Post cover image
Table of contents
A little context on CLIPUsing LLaVA-34B for multi-label classificationNow how do we fit the classification into a pipeline?Batch Inference on LLaVA

Sort: