Best of Daily Dose of Data Science | Avi Chawla | Substack — October 2024

1
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
5 Chunking Strategies For RAG
Chunking is a critical step in designing a Retrieval-Augmented Generation (RAG) application as it enhances the efficiency and accuracy of the retrieval process. The post discusses five chunking strategies: fixed-size, semantic, recursive, document structure-based, and LLM-based chunking. Each method has its unique benefits and trade-offs, focusing on maintaining semantic integrity and computational efficiency. The choice of technique depends on document structure, model capabilities, and computational resources.
74
1
2
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
What's Missing from Python OOP Encapsulation
Python doesn't strictly enforce encapsulation compared to languages like C++. Public, protected, and private members in Python are all accessible outside the class, with protected members acting like public ones and private members accessible via name mangling. Encapsulation in Python relies on conventions rather than strict rules, placing the responsibility on programmers to follow these conventions.
30
3
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
6 Graph Feature Engineering Techniques
Discover essential techniques for graph feature engineering, crucial for building effective graph neural networks (GNNs). Learn how to create a dummy social networking graph dataset and derive key features like node degree and centrality measures using NetworkX. The post highlights the significance of these features in enhancing model performance and provides real-world examples of graph machine learning applications by tech giants. Gain insights into various GNN tasks, data challenges, frameworks, and advanced architectures.
29
4
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
Identify Fuzzy Duplicates in a Million Records
Data duplication is a significant issue for many organizations, but traditional methods like Pandas' `df.drop_duplicates()` only handle exact duplicates. For fuzzy duplicates, which are not exact copies but appear similar, a naive approach of pairwise comparison is computationally infeasible at large scales. By leveraging the property of lexical overlap and applying bucketing techniques, unnecessary comparisons can be drastically reduced, optimizing the deduplication process. This approach can yield accurate results in hours rather than years, making it highly efficient for large datasets.
15
5
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
Clean ML Datasets With Cleanlab
Cleanlab, an open-source library developed by MIT researchers, helps clean datasets in just four lines of Python code. By identifying issues such as out-of-distribution samples, outliers, label problems, and duplicates, Cleanlab significantly improves dataset quality, which is crucial for training accurate machine learning models. Several demo notebooks are available for further learning.
12

See all Daily Dose of Data Science | Avi Chawla | Substack archives