Large Language Models (LLMs) like GPT, PaLM, and LLaMA have demonstrated impressive understanding and generation of natural language. However, extending their capabilities to videos with rich temporal information has been a challenge. To address this, a team of researchers has proposed an approach called ST-LLM, which leverages the sequence modeling capabilities of LLMs to process raw spatial-temporal video tokens directly. ST-LLM exhibits superior temporal understanding in video benchmarks and achieves state-of-the-art performance.
Sort: