Best of Data Architecture — 2024

1
Article
Community Picks·2y
How Google Search Works 🔥
Google Search started as a university project and evolved into a groundbreaking search engine through venture capital funding. The architecture consists of three primary stages: crawling, indexing, and searching. Crawlers are distributed globally to scan and download web content, which is then indexed to make it searchable. The search process involves cleaning queries, employing language models, and ranking results efficiently. Each component is designed for scalability and performance, laying the foundation of the modern Internet.
32
2
Article
Taro·2y
The Fundamentals of Data Engineering - Preface + Chapter 1: Data Engineering Described
This post provides an overview of the book 'Fundamentals of Data Engineering', discussing the motivations behind the book and the importance of data engineering in relation to data science and machine learning. It covers the Data Science Hierarchy of Needs, the Data Engineering Lifecycle, and the skills and activities of a Data Engineer. The post also discusses the stages of Data Maturity and the different types of Data Engineers. The next blog post will cover Chapter 2 of the book.
31
1
3
Article
Substack·2y
How to pass data engineer interviews in 2024
The post outlines strategies to pass data engineering interviews for 2024, emphasizing key interview types: SQL, data structures and algorithms, behavioral, data modeling, and data architecture. It provides detailed tips, such as coding efficiently in SQL, preparing for algorithm questions, and using the STAR method for behavioral interviews. Essential concepts in data modeling and architecture, including trade-offs and different architecture types, are also discussed. The author highlights the importance of clear communication, optimizing queries, and good interviewer rapport.
30
1
4
Article
Data Engineer Things·1y
The Data Lake, Warehouse and Lakehouse
The post explores the evolution of data architecture, beginning with traditional data warehouses, followed by the introduction of data lakes, and culminating in the emergence of the Lakehouse paradigm. It highlights the limitations of data warehouses and data lakes, such as challenges with unstructured data and data staleness. The Lakehouse architecture aims to combine the best features of both by utilizing low-cost storage and enhancing management features such as ACID transactions and query optimization. The post also mentions various technologies like Delta Lake, Apache Hudi, and Apache Iceberg that facilitate efficient data management in Lakehouse architectures.
29
5
Article
Netflix TechBlog·2y
Introducing Netflix TimeSeries Data Abstraction Layer
Netflix has introduced a TimeSeries Data Abstraction Layer designed to handle vast amounts of temporal data with millisecond access latency. Key features include efficient data partitioning, flexible storage integration (using Apache Cassandra and Elasticsearch), and scalability to manage high-throughput, immutable temporal event data. This abstraction layer optimizes storage and query efficiency, addressing issues like global read/write operations, tunable configurations, bursty traffic management, and cost efficiency. It plays a vital role in various Netflix services like user interaction tracking, feature rollout analysis, and asset impression tracking.
29
6
Article
Data Engineering·1y
Medallion Architecture Hype or Useful?
Medallion Architecture is a term coined by Databricks that aims to simplify data architecture for business and domain experts. However, it may be confusing for data professionals who are accustomed to classical data architecture models such as stage, cleansing, core, and mart, where marts are typically persisted in cubes for faster responses.
27
2
7
Video
Community Picks·2y
The Problem With UUIDs
UUIDs are designed to be universally unique but can negatively impact database performance when used as primary keys, especially in MySQL. The post discusses the various versions of UUIDs and their specific use cases, highlighting potential issues such as insert performance and storage utilization. It also explores alternatives like ordered UUIDs, nano IDs, and snowflake IDs to mitigate these concerns and improve efficiency in distributed systems.
27
4
8
Article
Redis·2y
Using Redis for real-time RAG goes beyond a Vector Database
The post discusses the importance of real-time access to data in GenAI applications and introduces Redis as a solution for real-time RAG. It explains Redis' vector search capabilities, semantic caching, and LLM Memory, and how they contribute to faster response times and improved user experiences. The post also provides benchmark results comparing real-time and non-real-time RAG architectures.
25
9
Article
Collections·1y
How Airbnb Processes a Million User Events Every Second
Airbnb's User Signals Platform processes over a million user events per second using the Lambda Architecture, combining real-time processing with historical data accuracy. Apache Flink, a stream-processing framework, is pivotal in achieving low latency, fault tolerance, and seamless integration, allowing Airbnb to enhance their recommendation system and drive revenue growth.
17
10
Article
Martin Fowler·1y
Designing data products
The post discusses a methodical approach to designing data products by working backwards from use cases. It outlines the characteristics of effective data products and differentiates them from data-driven applications. The approach focuses on avoiding overdesign and ensuring data products are discoverable, addressable, understandable, trustworthy, natively accessible, interoperable, valuable on their own, and secure. A real-world example in fashion retail is provided to illustrate the process.
15
11
Article
Data Engineer Things·2y
Rethinking Data Layers: When Medallion Architecture Isn’t Enough
Medallion Architecture's three-layer model (bronze, silver, gold) often falls short for large-scale businesses. Challenges like hidden crucial datasets and irregular updates necessitate more nuanced layers. Key considerations include granular pipeline tracking, regulatory compliance, data science needs, optimized reporting, data quality checks, and schema validation. Adapt layers based on organizational requirements, employing techniques such as raw data storage, schema validation, and data masking for better data integrity and security.
14
12
Video
The Serious CTO·2y
Data Mesh: The Future of Data Engineering Explained
Data Mesh redefines data architecture by decentralizing data management. Instead of centralizing all data in one big system, each department manages its own data, ensuring it's clean and accessible. This approach aims to eliminate bottlenecks, improve data quality, and foster better collaboration with shared standards across the company.
13
13
Article
ploeh blog·2y
Three data architectures for the server
Choosing the right software architecture is crucial for meeting future requirements, although predicting the future is not feasible. This post explores three data architectures—Ports and Adapters, Shared Data Model, and Domain Model only—to raise awareness of the alternatives available for persisting data in applications. The examples focus on persisting restaurant table configurations and highlight different data formats like JSON and XML. Readers are encouraged to understand the benefits and drawbacks of each architecture to make informed decisions.
12

See all Data Architecture archives