Large language models (LLMs) like JPEG-LM and AVC-LM can generate images and videos by outputting compressed file bytes in JPEG and H.264 formats. This approach outperforms specialized vision models on various benchmarks and highlights the potential for unified multimodal AI systems capable of handling text, images, and video

7m read time From notes.aimodels.fyi
Post cover image
Table of contents
OverviewPlain English ExplanationTechnical explanationCritical AnalysisConclusion

Sort: