Language models trained purely on text have a solid understanding of the visual world. They can generate complex scenes and refine their images. Researchers from MIT have tested the visual knowledge of these models and trained a computer vision system without using any visual data directly. The models demonstrate creativity in

4m read timeFrom news.mit.edu
Post cover image

Sort: