Best of Big Data — November 2024

1
Article
Data Engineer Things·2y
I spent 3 hours learning how Uber manages data quality.
Uber leverages a comprehensive data quality platform that utilizes automatic detection and management to maintain high data standards across over 2,000 datasets. The platform includes components such as Test Execution Engine, Test Generator, and Alert Generator to ensure operational excellence. The platform automates various tasks, such as generating tests and alerts, and rerunning failed tests to verify incidents. Uber also integrates its data quality tools with other platforms to provide a seamless experience for its internal teams.
131
2
2
Article
ByteByteGo·2y
How McDonald Sells Millions of Burgers Per Day With Event-Driven Architecture
McDonald's has developed a unified, event-driven platform to handle its global operations efficiently. The platform supports scalability, high availability, performance, security, reliability, consistency, and simplicity. Core components include AWS Managed Streaming for Kafka (MSK), a schema registry, a standby event store, custom SDKs, and an event gateway. The system ensures data integrity and efficient processing through schema validation and robust error-handling mechanisms. Key techniques include data governance, cluster autoscaling, and domain-based sharding. Future enhancements include formal event specification, transition to serverless MSK, and improved developer tooling.
75
2
3
Article
CrateDB·2y
Real-Time Data Indexing: Index Everything, Query Anything, Real-time
Relational databases have evolved significantly since their inception between 1976 and 1979, notably with the introduction of query optimization using indexes. The 2010s saw the rise of schemaless databases, allowing developers to manage data without predefined schemas. CrateDB enhances this concept by indexing every column by default using Lucene, providing high query efficiency but with increased storage requirements. This revolutionary approach simplifies and accelerates database management, potentially rendering traditional database optimization roles obsolete.
49
4
Article
Hacker News·2y
Building Databases over a Weekend
Databases are ubiquitous yet often viewed as complex systems, typically developed by specialized experts. Despite their complexity, innovation in database technology continues, with tools like Apache DataFusion simplifying the process for developers. DataFusion allows developers to build custom databases by extending or replacing various layers, particularly useful for creating bespoke query engines. This guide demonstrates how to implement a window operator for stream processing applications using DataFusion, detailing the integration into the physical and logical planning stages and optimizing the custom operator.
41
1
5
Article
Data Engineer Things·1y
I spent 4 hours learning how Netflix operates Apache Iceberg at scale.
Netflix has developed a sophisticated data platform to handle extensive data pipelines and analytics, using Apache Iceberg to overcome the limitations of their previous Hive-based system. Key components include Polaris, a custom metastore for Iceberg, and Janitors, a cleanup service. They also implemented Autotune for optimizing data layout and Autolift for localizing data files. Moreover, secure access controls were established for Iceberg tables. Netflix's migration tool for transitioning from Hive to Iceberg minimizes data movement and business interruptions.
36
6
Video
Fireship·2y
Apache Spark in 100 Seconds
Apache Spark is an open-source data analytics engine designed to process massive streams of data from multiple sources at high speed by performing most tasks in memory. Created in 2009 at UC Berkeley, it is widely used in various fields, including e-commerce and space research. It supports multiple languages through APIs and can be run locally or scaled across distributed systems. Spark also has robust machine learning capabilities with its MLlib library.
29
7
Article
Data Engineer Things·1y
How does Netflix ensure the data quality for thousands of Apache Iceberg tables?
Netflix employs the Write-Audit-Publish (WAP) pattern using Apache Iceberg to maintain high data quality across thousands of tables. The WAP pattern involves writing data to a hidden snapshot, auditing it, and publishing it only if it passes quality checks. This approach is analogous to CI/CD workflows, ensuring validated data is exposed to downstream consumers. Apache Iceberg's structure, including manifest files, metadata files, and catalog, supports efficient snapshot management and branching, similar to version control in Git.
23
8
Article
Data Engineer Things·1y
I spent 8 hours relearning the Delta Lake table format
Delta Lake is an ACID table storage layer built on cloud object storage, designed to address the challenges of using cloud storage for data lakes. It uses a transaction log to maintain data consistency and supports features like Time Travel, UPDATES, and DELETES. Concurrency is managed through optimistic concurrency control, and data mutation strategies include both copy-on-write and merge-on-read. The system is optimized for read/write performance and supports various data management features like layout optimizations and audit logging.
17
9
Article
asayer·1y
Data Lake vs Data Warehouse: Key Differences and When to Use Each
Data lakes and data warehouses are two primary storage solutions for big data. Data lakes store raw and diverse data types, making them ideal for machine learning and extensive data analytics. Data warehouses store structured data for quick analysis and reporting, suitable for business intelligence and real-time insights. A data lakehouse combines features of both, providing flexibility and high-speed performance for a variety of data storage needs.
15
10
Article
agoda·2y
A Day in the Life of a Data Engineer at Agoda
At Agoda, data fuels every decision and data engineers play a key role by designing and maintaining data pipelines. Lookuut Struchkov, a Staff Data Engineer, discusses his journey and daily responsibilities, including optimizing data pipelines, collaborating with various teams, and handling on-call support requests. He emphasizes the importance of continuous learning and shares advice for aspiring data engineers. Agoda stands out for its commitment to innovation, use of cutting-edge technologies like Spark and Scala, and its collaborative culture.
14
1
11
Article
Data Engineer Things·2y
I spent 4 hours learning Apache Spark Resource Allocation
An overview of Apache Spark's resource allocation mechanisms and scheduling modes. It covers static and dynamic resource allocation, highlighting how dynamic allocation uses heuristics for acquiring and removing executors. It also compares FIFO and fair scheduling, explaining how the latter ensures equal resource sharing among jobs. Additionally, considerations for gracefully decommissioning executors and the usage of an external shuffle service are discussed.
13
12
Article
Data Engineer Things·2y
Understanding Data Products and Data Contracts
Data products and data contracts are essential tools for transforming raw data into valuable assets. Data products are curated datasets crafted to solve specific business problems, while data contracts are formal agreements ensuring data quality and reliability between producers and consumers. These concepts help organizations manage data efficiently, foster trust, and drive innovation by defining clear standards and processes for data handling and access control.
12
13
Article
databricks·2y
From Data Warehousing to Data Intelligence: How Data Took Over
Organizations are moving into an era of data intelligence, using AI to understand and leverage enterprise data. This evolution was fueled by advancements like data lakehouse architecture, Apache Spark, Delta Lake, and MLflow. Over the past decade, these technologies helped break down data silos, streamline data management, and enable advanced analytics and AI. GenAI now drives this transformation further, allowing businesses to customize AI systems for their unique needs, improving both efficiency and governance in data handling.
11
14
Article
Last9·2y
The Parquet Files: A Surprisingly Entertaining Guide to Columnar Storage
Parquet files offer a more efficient approach to storing and querying large datasets compared to CSV files. Key benefits include significant file size reduction due to column-level compression, improved query performance through selective column access, and schema evolution support. The post covers best practices such as avoiding over-partitioning and choosing appropriate compression methods, ultimately highlighting the cost and performance advantages of using Parquet in big data analytics and cloud environments.
11
15
Article
Metadata·2y
DDIA: Chp 10. Batch Processing
Batch processing allows large-scale data transformations, and Google's MapReduce framework simplified parallel processing by abstracting network communication and failure handling. While Hadoop MapReduce leverages HDFS for distributed storage, newer dataflow engines like Spark and Flink address some limitations of MapReduce by offering more flexible operator connections and optimized computational resources.
10

See all Big Data archives