LoGeR is a feedforward 3D reconstruction system designed to handle extremely long video sequences (up to 19,000 frames). It addresses two core challenges: the quadratic complexity 'context wall' of full-attention models and the 'data wall' from training only on short sequences. The approach uses chunk-based processing with a hybrid memory module combining Sliding Window Attention (SWA) for precise local geometric alignment and Test-Time Training (TTT) for long-range global consistency. On benchmarks, LoGeR achieves a 30.8% relative improvement over prior feedforward methods on the 19k-frame VBR dataset and best average ATE of 18.65 on KITTI, while also remaining competitive on short-sequence benchmarks like 7-Scenes and ScanNet.

4m read timeFrom loger-project.github.io
Post cover image
Table of contents
Context Wall

Sort: