In this blog I will cover the pros and cons of using a Visual Large Language Model, more specifically LLaVA-1.6, in an offline batch zero-shot multi-label classification setting.

Lightricks' platform is a central hub for mobile photography enthusiasts and users of photo editing apps, offering insights into photography techniques, image editing tools, and creative inspiration. Through articles, tutorials, and user showcases, Lightricks offers insights into mobile photography tips, advanced editing features, and artistic effects. Readers can learn about capturing stunning photos, editing images like a pro, and expressing their creativity through mobile photography.

Lightricks Tech Blog

This post discusses the use of Visual LLMs, specifically LLaVA-1.6, in large-scale multi-label image classification pipelines. It covers the pros and cons of using CLIP models for zero-shot image classification and compares LLaVA's performance to the ViT-B-32 CLIP model on an age classification task. It also explores how to fit the classification into a pipeline and introduces batch inference on LLaVA to increase throughput. The results show that LLaVA is a viable option for multi-label classification pipelines and performs well on age classification tasks.

Using Visual LLMs in large scale multi-label image classification pipelines

Using LLaVA-34B for multi-label classification

Now how do we fit the classification into a pipeline?