You Don’t Need Many Labels to Learn
This title could be clearer and more informative.Try out Clickbait Shieldfor free (5 uses left this month).
A Gaussian Mixture Variational Autoencoder (GMVAE) can be trained entirely without labels and then converted into a classifier using as little as 0.2% labeled data — 35x less than XGBoost needs for comparable accuracy. The key insight is that unsupervised training already discovers the data's cluster structure; labels are only needed to name the clusters, not to build representations. Two decoding strategies are compared: hard decoding (assign each point to its most likely cluster) and soft decoding (use the full posterior distribution over clusters). Soft decoding provides an 18 percentage point accuracy gain when labeled data is extremely scarce, because it leverages model uncertainty and handles impure clusters. Experiments use the EMNIST Letters dataset with 145,600 images across 26 classes.
Table of contents
IntroductionTurning Clusters Into a ClassifierHow Much Supervision Do We Need in Practice?ConclusionSort: