Best of Data Processing — October 2024

1
Article
Data Engineer Things·2y
I spent 8 hours learning the details of the Apache Spark scheduling process.
The post delves into the details of the Apache Spark scheduling process. It covers the anatomy of a Spark job, stages, tasks, and the Directed Acyclic Graph (DAG) scheduler. It explains how SparkContext initiates scheduling, the roles of TaskScheduler and SchedulerBackend, and the concept of data locality in task execution. The post also discusses speculative execution to handle slow tasks and the entire end-to-end scheduling process in Spark.
87
2
Article
System Design Codex·2y
Kafka Load Balancing at Agoda for Terabytes of Data
Agoda uses Kafka to manage hundreds of terabytes of data across various supply systems, including hotels and restaurants, ensuring real-time price updates. They faced challenges with the traditional round-robin partitioning and consumer assignment due to heterogeneous hardware and uneven workloads, resulting in over-provisioning. Agoda addressed these issues by implementing dynamic, lag-aware strategies, including a lag-aware producer and consumer, to optimize message distribution and reduce latency.
56
1
3
Article
Baeldung·2y
Logstash vs. Kafka
Logstash and Kafka are powerful tools for managing real-time data streams, with Logstash specializing in data processing and Kafka excelling in distributed event streaming. Logstash is ideal for transforming log data and forwarding it to various outputs, while Kafka is designed for high-throughput, fault-tolerant message delivery. This post provides an in-depth comparison of their components, command-line examples, and discusses how they can work together to build robust data pipelines.
16
4
Article
Baeldung·2y
Introduction to Apache Hadoop
The post introduces Apache Hadoop, a powerful open-source framework designed for distributed storage and processing of large datasets. It explains Hadoop's core components, including HDFS for storage, YARN for resource management, and MapReduce for data processing. The tutorial guides through setting up a Hadoop cluster on a GNU/Linux platform and performing basic operations like file management and running MapReduce jobs. It also highlights several tools within the Hadoop ecosystem that support data ingestion, analysis, and extraction.
14
5
Article
Baeldung·2y
Determining Empty Row in an Excel File With Java
Identifying empty rows in Excel files is crucial for accurate data analysis. This guide discusses how to detect empty rows using three popular Java libraries: Apache POI, JExcel, and fastexcel. Each section covers adding Maven dependencies, creating helper methods to determine empty rows, and testing these methods to ensure their functionality.
10

See all Data Processing archives