Large language models (LLMs) like JPEG-LM and AVC-LM can generate images and videos by outputting compressed file bytes in JPEG and H.264 formats. This approach outperforms specialized vision models on various benchmarks and highlights the potential for unified multimodal AI systems capable of handling text, images, and video
•7m read time• From notes.aimodels.fyi
Sort: