This post explores the application of transformers in image processing within the field of computer vision, detailing three main methods: Pixel Transformers, Vision Transformers (ViT) by Google Brain, and Swin Transformers by Microsoft. It highlights the limitations of CNNs and offers solutions to computational inefficiencies, such as using image patches and techniques like window attention and hierarchical patches.
Sort: