Best of Big DataJune 2024

  1. 1
    Article
    Avatar of quastorQuastor Daily·2y

    The Architecture of Grab's Data Lake

    This post discusses the architecture of Grab's Data Lake, including the design choices for data storage formats, the use of Merge on Read and Copy on Write strategies, and the importance of efficient data storage for data analysis and insights.

  2. 2
    Article
    Avatar of substackSubstack·2y

    A Primer on Data Warehouses

    This post provides an overview of data warehouses, including their history, technology, and market trends. It explains why investors are interested in data and highlights the consolidation happening in the data space. The post also discusses the rise of SQL, the transition to cloud-based data warehouses, and the components of a data warehouse. It concludes with an analysis of the data warehousing market, including major players and trends.

  3. 3
    Article
    Avatar of nvidiadevNVIDIA Developer·2y

    Machine Learning – What Is It and Why Does It Matter?

    Many industries use data science and machine learning to recognize patterns, detect changes, and make predictions to enhance their operations. The availability of open-source tools has facilitated this trend since the mid-2000s. Today, improvements in predictive models can result in significant financial gains. However, training these models requires significant computational resources, with GPUs offering a solution to scalability issues that CPUs can no longer handle due to the limitations posed by Moore's law.

  4. 4
    Article
    Avatar of substackSubstack·2y

    The Future of Vector Search

    The article discusses the landscape of vector search and databases, the importance of choosing the right vector search system based on specific needs, and the features that differentiate these systems. It emphasizes the importance of deployment and scalability, performance and efficiency, and data reliability and security. The article also highlights the need for data quality and transparency in AI, drawing lessons from the inconsistencies in financial factors data.

  5. 5
    Article
    Avatar of tdsTowards Data Science·2y

    Data Engineering, Redefined

    The post argues for a redefinition of data engineering, separating it from the implementation of business logic, which should remain the domain of application developers. It highlights how current practices create brittle and uncoordinated data pipelines and proposes focusing data engineering on the movement, manipulation, and management of data in a technical sense. A call is made for clearer separation between business logic and data manipulation to improve software quality and maintainability.

  6. 6
    Article
    Avatar of collectionsCollections·2y

    Getting Started with PySpark: Efficient Data Processing for Beginners and Speeding up Machine Learning Projects

    PySpark, the Python API for Apache Spark, facilitates efficient big data processing and machine learning by distributing tasks across multiple machines. It’s easy for Python users to learn and scales well from a single machine to large clusters. This overview covers installation, basic usage, and creating custom functions to enhance machine learning projects with streamlined data preparation tasks like quality checks and finding duplicates.