Researchers have developed a cost-effective reward mechanism for video language models (VLMs) by utilizing detailed video captions. They have also created a comprehensive video caption dataset to address the shortage of high-quality video captions. With this reward mechanism, they achieved an 8.1% accuracy improvement on video question-answering tasks.
Sort: