Best of Big DataSeptember 2024

  1. 1
    Article
    Avatar of systemdesigncodexSystem Design Codex·2y

    Introduction to Kafka

    Kafka is a distributed event store and streaming platform initially developed by LinkedIn and now widely used by companies like Netflix and Uber for data pipelines. It is favored for its reliability and scalability. Kafka messages are written in batches to enhance efficiency, and these messages are categorized into topics and partitions. Producers send messages to Kafka brokers, while consumers read these messages. Kafka brokers usually function within a cluster, allowing for message replication and redundancy. Despite its benefits, Kafka has several complexities, including a plethora of configuration options and underdeveloped client libraries outside Java and C.

  2. 2
    Article
    Avatar of hnHacker News·2y

    Data Engineering Vault

    The Data Engineering Vault is a comprehensive resource designed to help you explore and discover interconnected terms in data engineering. It covers the definition and evolution of data engineering, highlighting the importance of tools like Python, Apache Airflow, and SQL. Additionally, it offers resources for getting started with data engineering, including must-read articles, influential books, and valuable community insights.

  3. 3
    Article
    Avatar of tdsTowards Data Science·2y

    The “Who Does What” Guide To Enterprise Data Quality

    Effective data quality management in large organizations involves clearly defined roles and responsibilities across foundational and derived data products. Foundational products, managed by a central team, serve multiple use cases, while derived products are tailored for specific needs and owned by domain-specific teams. Key practices include end-to-end monitoring, business rule application, and efficient triage processes. Building trust through communication and data health measurement is also crucial.

  4. 4
    Article
    Avatar of detlifeData Engineer Things·2y

    I spent 8 hours diving deep into Snowflake (again)

    Snowflake, a prominent cloud data warehouse solution, was revisited in 2024 to re-examine its architecture and internal workings. The platform, known for separating computing and storage, relies on cloud services like Amazon S3, Google Cloud Storage, and Azure Blob Store for storage, and uses a shared-nothing engine for compute power. Snowflake's system includes Virtual Warehouses, columnar storage, vectorized execution, and various caching mechanisms. It also uses FoundationDB for its data catalog management and employs runtime adaptivity in its query optimizer.

  5. 5
    Article
    Avatar of taiTowards AI·2y

    A Practical Approach to Using Web Data for AI and LLMs

    Businesses and researchers are increasingly relying on high-quality web data for AI and large language models (LLMs). Bright Data offers advanced tools to collect, manage, and use this data, making it easier to train models, improve real-time applications, and perform sentiment analysis. Their solutions ensure ethical data collection and compliance with privacy regulations, providing scalable infrastructure to handle various project needs. This is crucial as AI development demands not just vast amounts of data but also quality and relevance.

  6. 6
    Article
    Avatar of decuberssDecube·2y

    Top 10 Data Governance Tools for 2024

    Explore the top data governance tools for 2024 to enhance data security, compliance, and overall management. The global data governance software market is predicted to reach $11.8 billion in 2024, showcasing the increasing importance of effective data management. Key features to look for include data cataloging, lineage tracking, and data masking. Emerging trends focus on data privacy, AI-powered data management, and enterprise-wide governance strategies. The top tools like Collibra, Informatica Axon, and Decube offer essential functionalities to improve data quality and decision-making for businesses.

  7. 7
    Video
    Avatar of seriousctoThe Serious CTO·2y

    Data Mesh: The Future of Data Engineering Explained

    Data Mesh redefines data architecture by decentralizing data management. Instead of centralizing all data in one big system, each department manages its own data, ensuring it's clean and accessible. This approach aims to eliminate bottlenecks, improve data quality, and foster better collaboration with shared standards across the company.

  8. 8
    Article
    Avatar of singlestoreSingleStore·2y

    Designing a Real-Time Data Warehouse

    In the era of data-driven applications, real-time data warehouses (RTDW) are crucial for enabling low-latency analytical queries on fresh data. Unlike traditional data warehouses, RTDWs support continuous data ingestion and high concurrency, making them essential for applications like fraud detection and market analysis that require immediate insights. SingleStore offers a robust RTDW solution with real-time data ingestion, low-latency processing, high-concurrency support, scalability, and seamless integration, delivering real-time analytics at scale.

  9. 9
    Article
    Avatar of taiTowards AI·2y

    Journey From Data Warehouse To Lake To Lakehouse

    The post provides a fictional story to simplify the understanding of data storage concepts such as Data Warehouse, Data Lake, and Data Lakehouse. It highlights the evolution from the structured data storage of Data Warehouses, to the flexible, low-cost storage of Data Lakes, and finally to the comprehensive and efficient storage solutions of Data Lakehouses, which combine the benefits of both previous systems. Key concepts like schema-on-read and schema-on-write are explained, and top providers for each storage solution are recommended.

  10. 10
    Article
    Avatar of detlifeData Engineer Things·2y

    I spent 7 hours diving deep into Apache Iceberg

    This post delves into the internals of the Apache Iceberg file format, covering its data and metadata layers, manifest files, and how it manages read and write operations. It includes details on compaction, hidden partitioning, sorting, and row-level updates with both copy-on-write and merge-on-read modes. The goal is to offer a comprehensive understanding of Iceberg's capabilities and optimizations for managing large datasets efficiently.