Best of Daily Dose of Data Science | Avi Chawla | SubstackOctober 2024

  1. 1
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·1y

    5 Chunking Strategies For RAG

    Chunking is a critical step in designing a Retrieval-Augmented Generation (RAG) application as it enhances the efficiency and accuracy of the retrieval process. The post discusses five chunking strategies: fixed-size, semantic, recursive, document structure-based, and LLM-based chunking. Each method has its unique benefits and trade-offs, focusing on maintaining semantic integrity and computational efficiency. The choice of technique depends on document structure, model capabilities, and computational resources.

  2. 2
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·2y

    What's Missing from Python OOP Encapsulation

    Python doesn't strictly enforce encapsulation compared to languages like C++. Public, protected, and private members in Python are all accessible outside the class, with protected members acting like public ones and private members accessible via name mangling. Encapsulation in Python relies on conventions rather than strict rules, placing the responsibility on programmers to follow these conventions.

  3. 3
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·1y

    6 Graph Feature Engineering Techniques

    Discover essential techniques for graph feature engineering, crucial for building effective graph neural networks (GNNs). Learn how to create a dummy social networking graph dataset and derive key features like node degree and centrality measures using NetworkX. The post highlights the significance of these features in enhancing model performance and provides real-world examples of graph machine learning applications by tech giants. Gain insights into various GNN tasks, data challenges, frameworks, and advanced architectures.

  4. 4
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·1y

    Identify Fuzzy Duplicates in a Million Records

    Data duplication is a significant issue for many organizations, but traditional methods like Pandas' `df.drop_duplicates()` only handle exact duplicates. For fuzzy duplicates, which are not exact copies but appear similar, a naive approach of pairwise comparison is computationally infeasible at large scales. By leveraging the property of lexical overlap and applying bucketing techniques, unnecessary comparisons can be drastically reduced, optimizing the deduplication process. This approach can yield accurate results in hours rather than years, making it highly efficient for large datasets.

  5. 5
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·1y

    Clean ML Datasets With Cleanlab

    Cleanlab, an open-source library developed by MIT researchers, helps clean datasets in just four lines of Python code. By identifying issues such as out-of-distribution samples, outliers, label problems, and duplicates, Cleanlab significantly improves dataset quality, which is crucial for training accurate machine learning models. Several demo notebooks are available for further learning.