Yeah, you read it right. How do you process a really small file in Spark? It’s an overkill right? Well, this was the question I had to ask myself when I started working on Spark Playground — to…

Data Engineer Things

Demonstrates how to configure Apache Spark for processing very small datasets (100KB) by optimizing settings like using local mode, reducing shuffle partitions from 200 to 1-10, disabling Spark UI, and limiting executor resources. The author implemented these optimizations in Spark Playground, an online PySpark compiler running on AWS Lambda with Docker containers, achieving ~1 second average execution time for small data processing tasks.

How to process a 100 KB file using Spark — Wait, What!?

Create a Lambda function using a container image