This is another post based on my talk at NYC Machine Learning. The previous two parts covered most of the interesting parts, but there are still some topics left to be discussed. To go back and read the meaty stuff, check out  Part 1: What are vector models useful for? Part 2: How to search in high dimensional spaces – algorithms and data structures  You should also check out the slides and the video if you’re interested.

Erik Bernhardsson

An exploration of the curse of dimensionality and its impact on nearest neighbor search in high-dimensional spaces. As dimensions increase, the ratio between the closest and furthest point distances approaches zero, making meaningful distance comparisons difficult. However, real-world datasets like word embeddings (Freebase 1000D) and image datasets (MNIST 784D) exhibit much lower intrinsic dimensionality due to underlying structure — Freebase vectors behave like a 16D normal distribution. This explains why approximate nearest neighbor methods work well in practice and why dimensionality reduction is effective. The author also expresses skepticism toward LSH, favoring algorithms that learn the data distribution directly.

Nearest neighbors and vector models – epilogue – curse of dimensionality