Best of Big DataOctober 2024

  1. 1
    Article
    Avatar of detlifeData Engineer Things·2y

    I spent 8 hours learning the details of the Apache Spark scheduling process.

    The post delves into the details of the Apache Spark scheduling process. It covers the anatomy of a Spark job, stages, tasks, and the Directed Acyclic Graph (DAG) scheduler. It explains how SparkContext initiates scheduling, the roles of TaskScheduler and SchedulerBackend, and the concept of data locality in task execution. The post also discusses speculative execution to handle slow tasks and the entire end-to-end scheduling process in Spark.

  2. 2
    Article
    Avatar of bytebytegoByteByteGo·2y

    EP135: Big Data Pipeline Cheatsheet for AWS, Azure, and Google Cloud

    The post covers a variety of topics crucial for engineering leaders, including big data pipelines for AWS, Azure, and Google Cloud. It provides a detailed cheatsheet for key services like data ingestion, storage, processing, and visualization on each platform. It also discusses API architectural styles and offers a concise guide for building secure APIs. Additionally, there's a resource on key data structures used daily and an advertisement for an enterprise conference and a mini crash course on advanced AI tools.

  3. 3
    Article
    Avatar of detlifeData Engineer Things·2y

    I spent 6 hours learning Apache Arrow: Overview

    Apache Arrow is a standard memory format designed for efficient data processing in analytics workloads. It focuses on performance and interoperability by leveraging a columnar in-memory format and aligned memory allocation. Arrow minimizes serialization and deserialization costs, enabling efficient data sharing between systems. Key elements include physical memory layouts for arrays, record batch serialization, and IPC formats enabling seamless inter-process and network data transfers. Arrow is widely adopted by various data projects, enhancing their performance and data handling capabilities.

  4. 4
    Article
    Avatar of bytebytegoByteByteGo·2y

    How Uber Manages Petabytes of Real-Time Data

    Uber's real-time data infrastructure processes petabytes of data daily, supporting features like customer incentives and fraud detection. The system relies on Apache Kafka for streaming data, Apache Flink for stream processing, and Apache Pinot for real-time OLAP. Key requirements include consistency, availability, data freshness, scalability, and cost efficiency. Customizations and tools like FlinkSQL and uReplicator enhance reliability and performance. This enables real-time decisions such as dynamic pricing and operational insights. Scalability strategies, including Active-Active and Active-Passive Kafka setups, ensure high availability and fault tolerance.

  5. 5
    Article
    Avatar of decuberssDecube·2y

    Understanding Data Products and Data Contracts: Building Trust in Modern Data Management

    Data products and data contracts transform raw data into reliable assets, helping organizations manage data quality and access control. Data products are curated and cleaned-up data sets designed to solve specific business problems. Data contracts are formal agreements that ensure data meets specified quality and update standards, fostering trust. Domain management organizes data by business function, enhancing order and security.

  6. 6
    Article
    Avatar of baeldungBaeldung·2y

    Introduction to Apache Hadoop

    The post introduces Apache Hadoop, a powerful open-source framework designed for distributed storage and processing of large datasets. It explains Hadoop's core components, including HDFS for storage, YARN for resource management, and MapReduce for data processing. The tutorial guides through setting up a Hadoop cluster on a GNU/Linux platform and performing basic operations like file management and running MapReduce jobs. It also highlights several tools within the Hadoop ecosystem that support data ingestion, analysis, and extraction.

  7. 7
    Article
    Avatar of bigdataboutiqueBigData Boutique blog·2y

    Elasticsearch Performance and Cost Efficiency on Elastic Cloud and On-Prem

    Discover essential strategies to optimize Elasticsearch performance and cost efficiency for both Elastic Cloud and on-premises deployments. Key tactics include scaling up vs. scaling out, data tiering, continuous monitoring of critical metrics, efficient shard distribution, and advanced query optimization techniques. Participants in a recent webinar hosted by BigData Boutique and Elastic learned how to enhance their Elasticsearch setups for optimal performance and cost-effectiveness.

  8. 8
    Video
    Avatar of primeagenThePrimeTime·2y

    The Real 100x Dev

    A software engineer manages a massive online chess platform with minimal resources. Utilizing a unique tech stack, including Scala and MongoDB, the platform supports millions of games daily. The engineer emphasizes simplicity in coding, minimizing dependencies, and avoiding tech debt. The project's successful operation with minimal staff showcases the effectiveness of deliberate choices in tech and design.