The post discusses the limitations of CNNs in capturing long-range dependencies and global contextual understanding in computer vision tasks. It introduces transformers as an alternative architecture that excels in capturing global relationships. To combine the strengths of CNNs and transformers, the post presents Convolutional Self-Attention (CSA), which achieves both local and global feature relations using convolution operations. CSA demonstrates superior performance compared to contemporary transformer models, with faster latency and comparable accuracy when running on TensorRT. It is fully compatible with TensorRT restricted mode.

9m read timeFrom developer.nvidia.com
Post cover image
Table of contents
Fusing convolutions and self-attentionConvolutional Self-AttentionPerformance in accuracy and latencyConclusion

Sort: