Current AI vision models operate in 2D flatland and lack native 3D spatial understanding. Three converging AI layers can bridge this gap: metric depth estimation (e.g., Depth-Anything-3), foundation segmentation (e.g., SAM), and geometric fusion. The geometric fusion layer — the least discussed — uses camera intrinsics/extrinsics to back-project 2D predictions into 3D space, then applies a KD-tree ball-query voting algorithm to propagate sparse labels across unlabeled points. This four-stage pipeline (noise gate, spatial index, target identification, democratic vote) runs in under 10 seconds on 800K points with a consumer CPU and expands label coverage from ~20% to ~78% — a 3.5x amplification factor. Real-world results include reducing a 2-day annotation task to 11 minutes on a 12-million-point construction site scan. The remaining open problem is multi-view consistency, where per-image predictions from different viewpoints sometimes disagree at class boundaries.
Table of contents
The 3D annotation bottleneck that nobody talks aboutThree layers of spatial AI are converging right now into a single 3D labeling stackHow geometric reasoning turns 2D pixels into labeled 3D placesThe four-stage fusion pipeline for 3D label propagationFrom 20% to 78% label coverage: what 3D geometric fusion actually producesThe open problem in spatial AI: multi-view consistency and where 3D labeling is headingWhat I expect to unfold in the next 12 to 18 monthsResources for going deeper into spatial AI and 3D data scienceFrequently asked questions about spatial AI and 3D semantic understandingSort: