This post discusses the use of Visual LLMs, specifically LLaVA-1.6, in large-scale multi-label image classification pipelines. It covers the pros and cons of using CLIP models for zero-shot image classification and compares LLaVA's performance to the ViT-B-32 CLIP model on an age classification task. It also explores how to fit the classification into a pipeline and introduces batch inference on LLaVA to increase throughput. The results show that LLaVA is a viable option for multi-label classification pipelines and performs well on age classification tasks.

4m read timeFrom medium.com
Post cover image
Table of contents
A little context on CLIPUsing LLaVA-34B for multi-label classificationNow how do we fit the classification into a pipeline?Batch Inference on LLaVA

Sort: