Best of Data WarehouseSeptember 2024

  1. 1
    Article
    Avatar of detlifeData Engineer Things·2y

    I spent 5 hours learning how Google manages terabytes of metadata for BigQuery.

    Google BigQuery uses innovative techniques to manage massive amounts of metadata efficiently, treating it as crucial as the data itself. BigQuery's architecture includes Colossus for storage, Dremel for querying, and a dedicated shuffle service, all coordinated by Borg. Metadata is handled in a distributed manner using a unique columnar storage format called CMETA, improving efficiency and performance. Real-time data ensures physical query plans adapt dynamically for optimized results, while integrated metadata scans enhance query processing.

  2. 2
    Article
    Avatar of detlifeData Engineer Things·2y

    I spent 5 hours learning how ClickHouse built their internal data warehouse.

    ClickHouse has built an internal data warehouse to handle 50 TB of data daily, incorporating multiple internal sources like AWS, GCP, and Salesforce. They use Airflow for scheduling, AWS S3 as the intermediate data layer, and Superset for BI tools. Key features include data consistency, idempotency, and real-time analytics. They have also adopted dbt to streamline data transformations and introduced new tools for improved user access to data.

  3. 3
    Article
    Avatar of singlestoreSingleStore·2y

    Designing a Real-Time Data Warehouse

    In the era of data-driven applications, real-time data warehouses (RTDW) are crucial for enabling low-latency analytical queries on fresh data. Unlike traditional data warehouses, RTDWs support continuous data ingestion and high concurrency, making them essential for applications like fraud detection and market analysis that require immediate insights. SingleStore offers a robust RTDW solution with real-time data ingestion, low-latency processing, high-concurrency support, scalability, and seamless integration, delivering real-time analytics at scale.