Language models trained purely on text have a solid understanding of the visual world. They can generate complex scenes and refine their images. Researchers from MIT have tested the visual knowledge of these models and trained a computer vision system without using any visual data directly. The models demonstrate creativity in
Sort: