An exploration of the curse of dimensionality and its impact on nearest neighbor search in high-dimensional spaces. As dimensions increase, the ratio between the closest and furthest point distances approaches zero, making meaningful distance comparisons difficult. However, real-world datasets like word embeddings (Freebase 1000D) and image datasets (MNIST 784D) exhibit much lower intrinsic dimensionality due to underlying structure — Freebase vectors behave like a 16D normal distribution. This explains why approximate nearest neighbor methods work well in practice and why dimensionality reduction is effective. The author also expresses skepticism toward LSH, favoring algorithms that learn the data distribution directly.

4m read timeFrom erikbern.com
Post cover image

Sort: