Best of Big Data — 2024

1
Article
System Design Codex·2y
Introduction to Kafka
Kafka is a distributed event store and streaming platform initially developed by LinkedIn and now widely used by companies like Netflix and Uber for data pipelines. It is favored for its reliability and scalability. Kafka messages are written in batches to enhance efficiency, and these messages are categorized into topics and partitions. Producers send messages to Kafka brokers, while consumers read these messages. Kafka brokers usually function within a cluster, allowing for message replication and redundancy. Despite its benefits, Kafka has several complexities, including a plethora of configuration options and underdeveloped client libraries outside Java and C.
184
2
2
Article
Medium·2y
How Did LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka?
LinkedIn uses Apache Kafka to manage and process up to 7 trillion messages daily. They achieve reliability and scalability through a multi-tiered Kafka deployment across multiple data centers, leveraging local and aggregate clusters. LinkedIn ensures message completeness with an internal auditing tool that tracks sent and consumed messages. They maintain a close relationship with the open-source Kafka community by regularly integrating features and patches from their internal branches into the upstream Kafka branch.
175
4
3
Article
Community Picks·2y
The State of Data Engineering 2024
The 2024 State of Data Engineering report discusses the influence of GenAI on software infrastructure, the expansion of product offerings due to the economic downturn, and the impact of open table formats and their catalogs in the data lake industry. It also highlights the importance of data version control and observability in AI/ML systems.
144
3
4
Article
Data Engineer Things·2y
I spent 3 hours learning how Uber manages data quality.
Uber leverages a comprehensive data quality platform that utilizes automatic detection and management to maintain high data standards across over 2,000 datasets. The platform includes components such as Test Execution Engine, Test Generator, and Alert Generator to ensure operational excellence. The platform automates various tasks, such as generating tests and alerts, and rerunning failed tests to verify incidents. Uber also integrates its data quality tools with other platforms to provide a seamless experience for its internal teams.
131
2
5
Video
Community Picks·2y
7 Must-know Strategies to Scale Your Database
Understanding when and why to scale your database is essential to maintain optimal performance as your application grows. Key strategies include indexing for quick data retrieval, using materialized views for pre-computed snapshots of data, and implementing denormalization to simplify complex queries. Vertical scaling, adding resources to a single server, and caching frequently accessed data in a fast storage layer can enhance responsiveness. Replication bolsters availability and fault tolerance by creating database copies on multiple servers. Sharding, which involves splitting a database into smaller sections, enables horizontal scaling and manages large data loads efficiently.
89
6
Article
Data Engineer Things·2y
I spent 8 hours learning the details of the Apache Spark scheduling process.
The post delves into the details of the Apache Spark scheduling process. It covers the anatomy of a Spark job, stages, tasks, and the Directed Acyclic Graph (DAG) scheduler. It explains how SparkContext initiates scheduling, the roles of TaskScheduler and SchedulerBackend, and the concept of data locality in task execution. The post also discusses speculative execution to handle slow tasks and the entire end-to-end scheduling process in Spark.
87
7
Article
ByteByteGo·2y
EP135: Big Data Pipeline Cheatsheet for AWS, Azure, and Google Cloud
The post covers a variety of topics crucial for engineering leaders, including big data pipelines for AWS, Azure, and Google Cloud. It provides a detailed cheatsheet for key services like data ingestion, storage, processing, and visualization on each platform. It also discusses API architectural styles and offers a concise guide for building secure APIs. Additionally, there's a resource on key data structures used daily and an advertisement for an enterprise conference and a mini crash course on advanced AI tools.
87
2
8
Article
ByteByteGo·2y
How McDonald Sells Millions of Burgers Per Day With Event-Driven Architecture
McDonald's has developed a unified, event-driven platform to handle its global operations efficiently. The platform supports scalability, high availability, performance, security, reliability, consistency, and simplicity. Core components include AWS Managed Streaming for Kafka (MSK), a schema registry, a standby event store, custom SDKs, and an event gateway. The system ensures data integrity and efficient processing through schema validation and robust error-handling mechanisms. Key techniques include data governance, cluster autoscaling, and domain-based sharding. Future enhancements include formal event specification, transition to serverless MSK, and improved developer tooling.
75
2
9
Article
Quastor Daily·2y
How Canva Collects 25 Billion Events Per Day
Canva processes over 25 billion events daily using AWS Kinesis, benefiting from its real-time data analysis and cost-saving features. Their data pipeline involves event batching, compression, and enrichment before routing to Snowflake for further analysis. The switch from AWS SQS to Kinesis significantly reduced their costs by 85%.
73
1
10
Article
KDnuggets·2y
Project Ideas to Master Data Engineering
To effectively learn data engineering, working on projects is essential. Key skills to focus on include data transformation, data visualization, building data pipelines, and implementing data storage solutions like data lakes and data warehouses. The post suggests six project ideas to cover these aspects: building an end-to-end data pipeline, transforming data sets, implementing a data lake, creating a data warehouse, processing real-time data, and visualizing data with dashboards.
68
11
Video
YouTube·1y
Data Science Full Course - Complete Data Science Course | Data Science Full Course For Beginners IBM
Data science is a rapidly growing field with significant career opportunities due to the massive amounts of data produced and advancements in computing power and artificial intelligence. The course from IBM introduces key concepts and skills necessary for starting a career in data science, including big data, artificial intelligence, and cloud computing. It provides instructional videos, readings, practice assessments, and insights from data science professionals, concluding with a case study and a final peer-reviewed project.
67
12
Article
ByteByteGo·2y
Trillions of Indexes: How Uber’s LedgerStore Supports Such Massive Scale
Uber's LedgerStore is a custom-built solution to manage trillions of financial transaction records efficiently. It ensures data immutability and supports various types of indexes including strongly consistent, eventually consistent, and time-range indexes. The migration from DynamoDB to LedgerStore for Uber's payment data was driven by the need for cost savings, simplified architecture, improved performance, and tailored features for financial data management. This transition involved handling 1.2 PB of compressed data with zero data inconsistencies detected over six months.
55
3
13
Article
KDnuggets·2y
Tools Every AI Engineer Should Know: A Practical Guide
Being an AI engineer requires expertise in various tools and skills such as Python, R, big data frameworks like Hadoop and Spark, and cloud services like AWS, GCP, and Microsoft Azure. These tools are essential for building and optimizing AI systems. An AI engineer must also have solid programming knowledge, a deep understanding of machine learning, and practical experience through data projects, competitions, and open-source contributions.
54
1
14
Article
Data Engineer Things·2y
I spent 6 hours learning Apache Arrow: Overview
Apache Arrow is a standard memory format designed for efficient data processing in analytics workloads. It focuses on performance and interoperability by leveraging a columnar in-memory format and aligned memory allocation. Arrow minimizes serialization and deserialization costs, enabling efficient data sharing between systems. Key elements include physical memory layouts for arrays, record batch serialization, and IPC formats enabling seamless inter-process and network data transfers. Arrow is widely adopted by various data projects, enhancing their performance and data handling capabilities.
52
2
15
Article
CrateDB·2y
Real-Time Data Indexing: Index Everything, Query Anything, Real-time
Relational databases have evolved significantly since their inception between 1976 and 1979, notably with the introduction of query optimization using indexes. The 2010s saw the rise of schemaless databases, allowing developers to manage data without predefined schemas. CrateDB enhances this concept by indexing every column by default using Lucene, providing high query efficiency but with increased storage requirements. This revolutionary approach simplifies and accelerates database management, potentially rendering traditional database optimization roles obsolete.
49
16
Article
ByteByteGo·2y
How Uber Manages Petabytes of Real-Time Data
Uber's real-time data infrastructure processes petabytes of data daily, supporting features like customer incentives and fraud detection. The system relies on Apache Kafka for streaming data, Apache Flink for stream processing, and Apache Pinot for real-time OLAP. Key requirements include consistency, availability, data freshness, scalability, and cost efficiency. Customizations and tools like FlinkSQL and uReplicator enhance reliability and performance. This enables real-time decisions such as dynamic pricing and operational insights. Scalability strategies, including Active-Active and Active-Passive Kafka setups, ensure high availability and fault tolerance.
46
17
Article
KDnuggets·2y
5 Free Online Courses to Learn Data Engineering Fundamentals
Explore five free online courses designed to teach the fundamentals of data engineering. These courses range from beginner-friendly introductions to comprehensive professional certificates. Key areas covered include data pipelines, databases, Python and Pandas, cloud computing, and big data tools like Hadoop and Apache Spark.
46
1
18
Article
Data Engineer Things·1y
ETL and ELT
The author reflects on their journey from chasing the latest data engineering tools to focusing on foundational concepts, emphasizing the shift from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform). The traditional ETL process, necessitated by the high costs and limitations of early data warehouses, is contrasted with the modern ELT approach, facilitated by advancements in cloud data warehousing. ELT offers greater flexibility and efficiency by loading raw data into the warehouse and handling transformations within the warehouse, aligning better with agile development practices.
45
4
19
Article
Community Picks·2y
How SQL Enhances Your Data Science Skills
SQL is vital for data scientists due to its ability to efficiently retrieve, manipulate, and analyze large datasets. Key SQL concepts such as SELECT statements, WHERE clauses, JOIN operations, and aggregate functions enhance data exploration, preparation, and integration. Mastering these SQL skills complements other data science tools and improves overall data handling capabilities.
45
1
20
Article
Data Engineer Things·1y
Apache Flink Overview
Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. It excels in real-time processing with a model centered on streams, using components such as Dispatcher, JobManager, ResourceManager, and TaskManager. Flink differentiates between event-time and processing-time semantics to manage complexities in data flows. It also offers robust state management and checkpointing to ensure fault tolerance. Its architecture supports scalable, high-throughput, and low-latency processing environments, making it suitable for applications involving complex event data.
42
2
21
Article
Baeldung·1y
Introduction to Apache Accumulo
Apache Accumulo is a powerful, distributed key-value store designed for handling massive datasets with fine-grained security. Developed originally by the NSA and based on Google's Bigtable, it excels in scalability, performance, and security, enabling efficient data ingestion, retrieval, and processing. Accumulo supports cell-level security, server-side programming, and flexible data models, making it ideal for applications requiring strict access controls and large-scale data management.
41
1
22
Article
Hacker News·2y
Building Databases over a Weekend
Databases are ubiquitous yet often viewed as complex systems, typically developed by specialized experts. Despite their complexity, innovation in database technology continues, with tools like Apache DataFusion simplifying the process for developers. DataFusion allows developers to build custom databases by extending or replacing various layers, particularly useful for creating bespoke query engines. This guide demonstrates how to implement a window operator for stream processing applications using DataFusion, detailing the integration into the physical and logical planning stages and optimizing the custom operator.
41
1
23
Article
Quastor Daily·2y
The Architecture of Grab's Data Lake
This post discusses the architecture of Grab's Data Lake, including the design choices for data storage formats, the use of Merge on Read and Copy on Write strategies, and the importance of efficient data storage for data analysis and insights.
40
24
Article
Community Picks·1y
dask/dask: Parallel computing with task scheduling
Dask is a flexible parallel computing library designed for analytics. It enables efficient task scheduling and is licensed under the New BSD License.
36
1
25
Article
Data Engineer Things·1y
I spent 4 hours learning how Netflix operates Apache Iceberg at scale.
Netflix has developed a sophisticated data platform to handle extensive data pipelines and analytics, using Apache Iceberg to overcome the limitations of their previous Hive-based system. Key components include Polaris, a custom metastore for Iceberg, and Janitors, a cleanup service. They also implemented Autotune for optimizing data layout and Autolift for localizing data files. Moreover, secure access controls were established for Iceberg tables. Netflix's migration tool for transitioning from Hive to Iceberg minimizes data movement and business interruptions.
36

See all Big Data archives