Don't you worry sire! I have arrived to maketh thou understand how to remove shuffling while implementing joins in spark 👨‍🦳. Nope, I am not talking about Broadcast join. (When did I write a simple…

TowardsDev's platform is a resource for developers, offering insights into software development, coding tutorials, and technology news. Through articles, tutorials, and coding challenges, TowardsDev offers insights into programming languages, development frameworks, and best practices in software engineering. Readers can learn about algorithms, data structures, and problem-solving techniques to enhance their coding skills and prepare for technical interviews.

Towards Dev

The post explains how to optimize joins in Apache Spark by using Sort-Merge-Bucket Join (SMB join) instead of the traditional Sort-Merge Join (SM join). It details the steps involved in SMB join, which include creating and sorting buckets based on the join key before performing the join operation, thus eliminating the need for the computationally expensive shuffle step. This approach helps to dramatically reduce resource consumption and improve performance in Spark jobs.

Spark—Beyond Basics: SMB join in Apache Spark (No shuffle join)