LLMDet is a novel open-vocabulary object detection model leveraging large language models to enhance detection accuracy, generalization, and rare-class recognition. It uses a new dataset with detailed and concise captions to improve vision-language alignment. The two-stage training process includes grounding and captioning losses, utilizing image-level and region-level annotations. LLMDet advances the field by overcoming limitations of existing methods, showing state-of-the-art performance in various benchmarks and promoting multi-modal learning integration.
Sort: