Handling large images is a challenge due to the loss of information when down-sampling or cropping. The xT framework introduces a nested tokenization approach to divide images into regions and sub-regions for better processing. It combines region encoders and context encoders to extract features at different scales and provide

6m read timeFrom bair.berkeley.edu
Post cover image
Table of contents
Why Bother with Big Images Anyway?How x x T Tries to Fix ThisNested TokenizationCoordinating Region and Context EncodersResultsWhy This Matters More Than You ThinkIn Conclusion

Sort: