Best of MultimodalAugust 2024

  1. 1
    Article
    Avatar of aimodelsfyiAIModels.fyi·2y

    LLMs can speak in JPEG

    Large language models (LLMs) like JPEG-LM and AVC-LM can generate images and videos by outputting compressed file bytes in JPEG and H.264 formats. This approach outperforms specialized vision models on various benchmarks and highlights the potential for unified multimodal AI systems capable of handling text, images, and video through a common architecture. While this method shows promise in generating diverse visual elements, questions remain about its scalability, flexibility, and applicability to tasks like image classification and visual understanding.

  2. 2
    Video
    Avatar of communityCommunity Picks·2y

    Gemini AI MultiModal Model Course

    This course teaches how to build an app using the Gemini multimodal AI model developed by Google. The app can analyze uploaded images and provide text-based responses to questions about those images. The course covers an introduction to Gemini, setting up the development environment, authentication, and building the app using Node.js and React. Google provided funding through a grant to make the course possible.