Best of Big DataDecember 2024

  1. 1
    Video
    Avatar of youtubeYouTube·1y

    Data Science Full Course - Complete Data Science Course | Data Science Full Course For Beginners IBM

    Data science is a rapidly growing field with significant career opportunities due to the massive amounts of data produced and advancements in computing power and artificial intelligence. The course from IBM introduces key concepts and skills necessary for starting a career in data science, including big data, artificial intelligence, and cloud computing. It provides instructional videos, readings, practice assessments, and insights from data science professionals, concluding with a case study and a final peer-reviewed project.

  2. 2
    Article
    Avatar of detlifeData Engineer Things·1y

    ETL and ELT

    The author reflects on their journey from chasing the latest data engineering tools to focusing on foundational concepts, emphasizing the shift from ETL (Extract, Transform, Load) to ELT (Extract, Load, Transform). The traditional ETL process, necessitated by the high costs and limitations of early data warehouses, is contrasted with the modern ELT approach, facilitated by advancements in cloud data warehousing. ELT offers greater flexibility and efficiency by loading raw data into the warehouse and handling transformations within the warehouse, aligning better with agile development practices.

  3. 3
    Article
    Avatar of detlifeData Engineer Things·1y

    Apache Flink Overview

    Apache Flink is a framework and distributed processing engine for stateful computations over unbounded and bounded data streams. It excels in real-time processing with a model centered on streams, using components such as Dispatcher, JobManager, ResourceManager, and TaskManager. Flink differentiates between event-time and processing-time semantics to manage complexities in data flows. It also offers robust state management and checkpointing to ensure fault tolerance. Its architecture supports scalable, high-throughput, and low-latency processing environments, making it suitable for applications involving complex event data.

  4. 4
    Article
    Avatar of baeldungBaeldung·1y

    Introduction to Apache Accumulo

    Apache Accumulo is a powerful, distributed key-value store designed for handling massive datasets with fine-grained security. Developed originally by the NSA and based on Google's Bigtable, it excels in scalability, performance, and security, enabling efficient data ingestion, retrieval, and processing. Accumulo supports cell-level security, server-side programming, and flexible data models, making it ideal for applications requiring strict access controls and large-scale data management.

  5. 5
    Article
    Avatar of communityCommunity Picks·1y

    dask/dask: Parallel computing with task scheduling

    Dask is a flexible parallel computing library designed for analytics. It enables efficient task scheduling and is licensed under the New BSD License.

  6. 6
    Article
    Avatar of detlifeData Engineer Things·1y

    The Data Lake, Warehouse and Lakehouse

    The post explores the evolution of data architecture, beginning with traditional data warehouses, followed by the introduction of data lakes, and culminating in the emergence of the Lakehouse paradigm. It highlights the limitations of data warehouses and data lakes, such as challenges with unstructured data and data staleness. The Lakehouse architecture aims to combine the best features of both by utilizing low-cost storage and enhancing management features such as ACID transactions and query optimization. The post also mentions various technologies like Delta Lake, Apache Hudi, and Apache Iceberg that facilitate efficient data management in Lakehouse architectures.

  7. 7
    Article
    Avatar of bytebytegoByteByteGo·1y

    How Statsig Streams 1 Trillion Events A Day

    Statsig processes over a trillion events daily for high-profile clients such as OpenAI and Atlassian, with a robust data pipeline designed for scalability and cost-efficiency. Key components include a reliable data ingestion layer, scalable message queues, and effective routing and integration techniques. Their strategy involves using Google Cloud Storage, Pub/Sub, spot nodes, and advanced compression methods to optimize performance and minimize costs, ensuring high reliability and low latency.

  8. 8
    Article
    Avatar of mlmMachine Learning Mastery·1y

    Machine Learning vs. Traditional Analytics: When to Use Which?

    Understanding the differences between data analytics, data science, big data, and business intelligence is crucial. Data analytics focuses on predicting future patterns to support business decisions, while machine learning, a subfield of AI, builds models to perform tasks like classification and regression. Machine learning is best used for making predictions from complex datasets, whereas traditional analytics methods are suited for understanding historical data and identifying trends in smaller datasets.

  9. 9
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·1y

    Train Classical ML Models on Large Datasets

    Cohere announces Command R7B, a lightweight, fast, and enterprise-ready multilingual 7B-parameter model suitable for real-time chatbots and AI agents. Additionally, methods to train classical ML models on large datasets, such as using big-data frameworks like Spark MLlib or the Random Patches approach, are discussed. Random Patches, which involves sampling data patches for tree-based models, often performs better than traditional random forests in certain cases.

  10. 10
    Article
    Avatar of collectionsCollections·1y

    How Airbnb Processes a Million User Events Every Second

    Airbnb's User Signals Platform processes over a million user events per second using the Lambda Architecture, combining real-time processing with historical data accuracy. Apache Flink, a stream-processing framework, is pivotal in achieving low latency, fault tolerance, and seamless integration, allowing Airbnb to enhance their recommendation system and drive revenue growth.

  11. 11
    Article
    Avatar of mafoMartin Fowler·1y

    Designing data products

    The post discusses a methodical approach to designing data products by working backwards from use cases. It outlines the characteristics of effective data products and differentiates them from data-driven applications. The approach focuses on avoiding overdesign and ensuring data products are discoverable, addressable, understandable, trustworthy, natively accessible, interoperable, valuable on their own, and secure. A real-world example in fashion retail is provided to illustrate the process.

  12. 12
    Article
    Avatar of detlifeData Engineer Things·1y

    The Ultimate Guide to Zero ETL: Real-Time Insights, Benefits, Challenges, and Best Practices

    Zero ETL (Extract, Transform, Load) is a data processing technique that minimizes or eliminates traditional ETL workflows by enabling real-time data access and analysis. It offers benefits like reduced latency, lower complexity, increased flexibility, and scalability. However, it also presents challenges, including data governance and compliance risks, complex data integration, and potential vendor lock-in. Zero ETL is ideal for use cases such as real-time analytics in e-commerce, data-driven marketing campaigns, IoT and sensor data integration, and fraud detection in financial services. Best practices involve robust monitoring, implementing security measures, and maintaining clear communication about data changes.

  13. 13
    Article
    Avatar of detlifeData Engineer Things·1y

    The Many Data Problem: Is Your Company Struggling with too much Data?

    Companies are now facing a 'Many Data problem' due to the ease of data creation and increasing reliance on data for business decisions. Challenges include lack of data interoperability, excessive and unvaluable dashboards, a need for data governance, rising cloud data warehouse costs, and poor data quality. Focusing on improving interoperability, reducing unnecessary dashboards, implementing governance, optimizing costs, and enhancing data quality can help manage this problem effectively.