This post discusses the use of Visual LLMs, specifically LLaVA-1.6, in large-scale multi-label image classification pipelines. It covers the pros and cons of using CLIP models for zero-shot image classification and compares LLaVA's performance to the ViT-B-32 CLIP model on an age classification task. It also explores how to fit the classification into a pipeline and introduces batch inference on LLaVA to increase throughput. The results show that LLaVA is a viable option for multi-label classification pipelines and performs well on age classification tasks.
Table of contents
A little context on CLIPUsing LLaVA-34B for multi-label classificationNow how do we fit the classification into a pipeline?Batch Inference on LLaVASort: