Researchers have introduced VCoder, a method to enhance the perceptual capabilities of Multimodal Large Language Models (MLLMs) by incorporating additional perception modalities into the models. VCoder improves object perception tasks, such as accurately identifying and counting objects within a visual scene, without compromising the reasoning abilities of the models. Experimental evidence shows VCoder's effectiveness in enhancing model performance on less frequently represented information in the training data, increasing the models' robustness and factuality.

4m read timeFrom marktechpost.com
Post cover image

Sort: