Best of Data Engineering — August 2024

1
Article
Medium·2y
How Did LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka?
LinkedIn uses Apache Kafka to manage and process up to 7 trillion messages daily. They achieve reliability and scalability through a multi-tiered Kafka deployment across multiple data centers, leveraging local and aggregate clusters. LinkedIn ensures message completeness with an internal auditing tool that tracks sent and consumed messages. They maintain a close relationship with the open-source Kafka community by regularly integrating features and patches from their internal branches into the upstream Kafka branch.
175
4
2
Article
KDnuggets·2y
Project Ideas to Master Data Engineering
To effectively learn data engineering, working on projects is essential. Key skills to focus on include data transformation, data visualization, building data pipelines, and implementing data storage solutions like data lakes and data warehouses. The post suggests six project ideas to cover these aspects: building an end-to-end data pipeline, transforming data sets, implementing a data lake, creating a data warehouse, processing real-time data, and visualizing data with dashboards.
68
3
Article
ByteByteGo·2y
Trillions of Indexes: How Uber’s LedgerStore Supports Such Massive Scale
Uber's LedgerStore is a custom-built solution to manage trillions of financial transaction records efficiently. It ensures data immutability and supports various types of indexes including strongly consistent, eventually consistent, and time-range indexes. The migration from DynamoDB to LedgerStore for Uber's payment data was driven by the need for cost savings, simplified architecture, improved performance, and tailored features for financial data management. This transition involved handling 1.2 PB of compressed data with zero data inconsistencies detected over six months.
55
3
4
Article
Medium·2y
Embracing Simplicity and Composability in Data Engineering
The post highlights the importance of simplicity and composability in data engineering, drawing lessons from decades of industry experience. It discusses the Unix philosophy of treating data as files, the evolution of databases and NoSQL, and the complexity introduced by new ecosystems like Hadoop and Kubernetes. The post also critiques the over-complication of agile methodologies and stresses the necessity of adhering to fundamental principles to maintain flexibility and long-term value in software systems.
24
5
Article
Towards Data Science·2y
What It Takes to Build a Great Graph
Graphs represent relationships and connections in data, making them powerful tools for analysis. A great graph has a clear purpose, is domain-specific, and has a well-defined schema. Successful implementation requires mechanisms for connecting datasets, scalability, and handling temporality. Designing a robust graph-based solution involves clear engineering practices and experienced graph data engineers.
22
6
Article
Towards Dev·2y
Spark — Beyond Basics: Hidden actions in your spark code
The post discusses hidden actions that can be mistaken for transformations in Apache Spark. It uses examples from Spark code snippets, such as `read.csv()`, `df.groupby().pivot()`, and `foreach()`, to explain how certain operations trigger jobs. Key insights include the impact of the inferSchema option turning a transformation into an action, and the unique behavior of pivot and foreach actions.
15
7
Article
ITNEXT·2y
Bridging Backend and Data Engineering: Communicating Through Events
Integrating backend services with data engineering pipelines can be challenging with traditional methods like REST APIs and batch processing. Event-driven architecture (EDA) offers a robust solution through asynchronous event communication. For smaller teams, a hybrid solution such as using a Pub/Sub system (e.g., GCP Pub/Sub, Amazon SQS) can be effective. Define a standardized format for events and set up a single topic for all services to broadcast their events. Subscriptions determine which events to process, allowing for flexible, real-time communication without overhauling existing infrastructure.
14
8
Article
Last9·2y
Control Plane: A centralized place to manage your data and its settings
Control Plane centralizes the management of data policies and settings for ingestion, storage, and queries. It also aids in debugging with tools like Cardinality Explorer and Slow Query Logs. Recent updates include support for Logs, Traces, Metrics, and Events pipelines, as well as segregation of settings by organization and user. Notification Channels are now managed in Alert Studio alongside other alert configurations.
11

See all Data Engineering archives