Part 2 of a series on summarizing large documents using clustering and LLMs. Starting from 1360 chunks of the GitLab Employee Handbook (220,035 tokens), the workflow evaluates K-means cluster quality using silhouette, Calinski-Harabasz, and Davies-Bouldin scores, then visualizes clusters with UMAP dimensionality reduction. Representative chunks are selected per cluster using Euclidean distance to centroids, individually summarized via LangChain map prompts, and then combined into a final summary — reducing context to just 4,219 tokens (98% reduction). The post honestly assesses both strengths (scalable, thematic coverage) and weaknesses (final aggregation narrows topic breadth), and suggests improvements like stronger reduce prompts and multiple representatives per cluster.
Table of contents
A quick technical refresherClass is resumed…welcome back from the holidays!Getting up close and personalYeah, I’m not convinced yet with our ClustersIt’s time to represent!Can we start summarizing the document already?Was it all worth it?SingularityWhere did we go Right?Where did we go Wrong?Class Dismissed, This Time for RealSort: