Large language models (LLMs) like JPEG-LM and AVC-LM can generate images and videos by outputting compressed file bytes in JPEG and H.264 formats. This approach outperforms specialized vision models on various benchmarks and highlights the potential for unified multimodal AI systems capable of handling text, images, and video through a common architecture. While this method shows promise in generating diverse visual elements, questions remain about its scalability, flexibility, and applicability to tasks like image classification and visual understanding.
Sort: