Researchers propose Vision Mamba (Vim), a new generic vision backbone with bidirectional Mamba blocks. Vim combines position embeddings for location-aware visual identification with bidirectional SSMs for data-dependent global visual context modeling. It achieves the same modeling power as ViT without requiring attention and outperforms the DeiT model in terms of performance.
Sort: