Best of Daily Dose of Data Science | Avi Chawla | Substack — 2024

1
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
15 DS/ML Cheat Sheets
This post collates 15 cheat sheets covering essential data science and machine learning concepts. It includes resources on translating between different data manipulation libraries, multi-GPU training strategies, testing ML models in production, neural network optimization, and more. Detailed links are provided for further reading.
232
1
2
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
Building a 100% Local mini-ChatGPT
A guide on building a local mini-ChatGPT app using the Llama3.2-vision model and Chainlit. The post includes a demo, necessary tools, and step-by-step coding instructions with multimodal prompting. The code and further resources for AI engineering are provided on GitHub.
169
4
3
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
20 Most Common Magic Methods
Discover the 20 most common magic methods used in Python OOP, including __new__, __init__, and __str__. Learn how to use these methods and their importance in Python projects.
146
5
4
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
A Crash Course on Building RAG Systems – Part 4
Part 4 of the crash course on building RAG systems focuses on implementing RAG on multimodal data, specifically complex documents with tables, texts, and images. This series covers foundational components, evaluation methods, optimization techniques, and handling large data sets, making it highly beginner-friendly. Understanding how to build reliable RAG systems can reduce costs and enhance scalability for enterprises, bypassing the need for fine-tuning large language models (LLMs).
118
5
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
Pandas vs. FireDucks Performance Comparison
FireDucks is a highly optimized alternative to Pandas, boasting a significant speed improvement through lazy execution. Users only need to replace their Pandas import with FireDucks. Benchmarks show FireDucks outperforming Pandas and other libraries like Modin and Polars, particularly in its speedy performance. The post provides instructions for installing FireDucks, using it in Jupyter Notebook, and integrating it into existing Python scripts.
98
2
6
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
RAG vs Agentic RAG
Agentic RAG systems introduce dynamic, adaptable behaviors into the traditional RAG workflow. Unlike traditional RAG, which retrieves and generates once, agentic RAGs iteratively refine queries and context, adapting based on the problem's complexity. This makes them more effective for complex queries and problem-solving. The open-source tool Opik by CometML supports the evaluation, testing, and monitoring of LLM applications from development to production, offering features like logging traces and detecting hallucinations.
86
7
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
A crash course on RAG systems—Part 5
Part 5 of the RAG crash course focuses on the implementation of key components for multimodal RAG systems, such as CLIP embeddings, multimodal prompting, and tool calling. The series aims to educate readers on building reliable RAG systems that can reduce costs and handle complex data types, ultimately aiding businesses in achieving greater impact.
83
8
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
25 Most Important Mathematical Definitions in Data Science
The importance of mathematical knowledge in data science and machine learning, a list of important mathematical formulations used in data science and statistics, and the use of mean squared error (MSE) in machine learning.
82
9
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
A Crash Course on Graph Neural Networks
Graph Neural Networks (GNNs) extend deep learning techniques to graph data, addressing the limitations of traditional models in capturing complex relationships. This piece covers the basics, benefits, tasks, data challenges, frameworks, and practical implementation of GNNs.
75
10
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
5 Chunking Strategies For RAG
Chunking is a critical step in designing a Retrieval-Augmented Generation (RAG) application as it enhances the efficiency and accuracy of the retrieval process. The post discusses five chunking strategies: fixed-size, semantic, recursive, document structure-based, and LLM-based chunking. Each method has its unique benefits and trade-offs, focusing on maintaining semantic integrity and computational efficiency. The choice of technique depends on document structure, model capabilities, and computational resources.
74
1
11
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
6 Elegant Jupyter Hacks
Discover 6 elegant Jupyter hacks to improve your experience. Learn how to retrieve a cell's output, enrich the default preview of a DataFrame, generate helpful hints as you write Pandas code, improve rendering of DataFrames, restart the Jupyter kernel without losing variables, and search code in all Jupyter Notebooks from the terminal.
65
3
12
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
A crash course on RAG systems—Part 6
Part 6 of the crash course on RAG systems explores how to build a more extensive and capable multimodal RAG system using CLIP embeddings, multimodal prompting, and tool calling. The post includes a unique dataset combining social media posts with images to provide a practical learning experience. The series covers everything from foundational components and evaluation to optimization and handling complex documents, aiming to help users implement reliable RAG systems and solve key NLP challenges with LLMs.
57
13
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
Simplify Python Imports with Explicit Packaging
Learn how to simplify your Python project imports by explicitly packaging your project with an __init__.py file. This method not only helps to avoid redundant imports but also allows you to specify which classes and functions can be imported from the package. The article explains the difference between modules, packages, and libraries, and provides a step-by-step guide on how to use __init__.py to streamline your code.
56
1
14
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
Building a RAG app using Llama-3.3
Meta released Llama-3.3, and this post provides a hands-on demo for building a RAG app using it. The app allows users to interact with a document via chat. It uses LlamaIndex for orchestration, Qdrant for a self-hosted vector database, and Ollama for serving Llama-3.3 locally. The implementation steps include loading and parsing a knowledge base, creating embeddings, indexing and storing them, defining a custom prompt template, and setting up a query engine.
48
15
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
4 Ways to Test ML Models in Production
Testing ML models in production is crucial to ensure reliability and performance on real-world data. Four common strategies are A/B testing, canary testing, interleaved testing, and shadow testing. A/B testing distributes requests non-uniformly between models, while canary testing gradually rolls out the candidate model to a subset of users. Interleaved testing mixes predictions from both models, and shadow testing logs outputs without affecting user experience. These techniques help mitigate risks and validate the model effectively.
46
16
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
A Simple Implementation of Boosting Algorithm
Boosting is a machine learning technique where each successive model attempts to correct the errors of its predecessor, leading to improved performance. Key design choices include tree construction, loss function, and weighting of each tree's contribution. A step-by-step example using the Sklearn decision tree regressor shows how boosting works and the incremental improvement in R2 scores. Boosting algorithms are particularly significant for tabular data in machine learning.
44
17
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
[Hands-on] Building a Llama-OCR app
Learn how to build a Llama-OCR app using the Llama-3.2-vision model and Ollama for local serving. The app converts uploaded images into structured markdown. The post provides a step-by-step guide on downloading necessary tools and prompting the model. Code for the full app is available on GitHub.
42
18
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
[Hands-on] Tool calling in LLMs
Tool calling allows language models to perform specific tasks by invoking external tools or APIs. The process involves recognizing when an external tool is needed, invoking the tool, and integrating its output into the model's response. This enhances the flexibility and capability of LLMs. A demo is provided to build a stock price retrieval assistant using the yfinance library.
41
1
19
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
CPython vs. Cython: How to Speed-up Native Python Programs
Learn how Cython optimizes Python's performance by converting Python code into C, resulting in significant speed improvements and reduced memory overheads. The post contrasts CPython's lack of built-in optimization with Cython's ability to restrict Python’s dynamicity through explicit data typing. The guide includes practical steps for implementing Cython in a Jupyter Notebook to achieve over 100x speedup in code execution.
40
2
20
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
9 Python Command Line Flags
Discover the 9 most common Python command line flags and how they modify the behavior of the Python interpreter. This includes flags like `-c` for running commands directly in the command line, `-i` for entering interactive mode after script execution, and `-O` and `-OO` for optimizing code by ignoring assert statements and docstrings. Additional flags like `-W` for ignoring warnings, `-m` for running modules as scripts, `-v` for verbose mode, `-x` for skipping the first line of a script, and `-E` for ignoring Python environment variables are also covered.
38
1
21
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
Building a Multi-agent Financial Analyst
The post demonstrates building a multi-agent financial analyst using Microsoft's Autogen and Llama3-70B. It outlines the tech stack, including the roles of code executor and code writer agents. The guide provides steps to set up the agents, execute code, and display stock analysis results. Additional resources and a GitHub repository for further exploration are also mentioned.
36
22
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
5 LLM Fine-tuning Techniques Explained Visually
This post explains five fine-tuning techniques for LLMs, including LoRA, LoRA-FA, VeRA, Delta-LoRA, and LoRA+.
36
23
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
Our Agentic Workflow to Write and Publish Social Content
A personal multi-agent app was developed to automate the creation and publication of social media content. The tech stack includes CrewAI for building workflows, FireCrawl for web scraping, and Typefully for post scheduling. The app processes content from a blog or newsletter, understands the writing style, and drafts posts for LinkedIn and X, publishing them via Typefully's API. Detailed insights and code are accessible in CrewAI's documentation.
35
1
24
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
16 Popular Open-source Contributions by Big Tech
Big tech companies like Microsoft, Google, Meta, Yandex, and NVIDIA have significantly contributed to the machine learning ecosystem through various open-source projects. These contributions include Microsoft's DeepSpeed and ONNX, Google's TensorFlow and JAX, Meta's PyTorch and LLaMA, Yandex's CatBoost and ClickHouse, and NVIDIA's RAPIDS and TensorRT. Understanding these tools can help you tackle real-world problems efficiently.
35
1
25
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
GROUPING SETS in SQL
Learn how to efficiently run multiple aggregations in SQL using GROUPING SETS, which allows scanning the table just once. This method is more efficient compared to using UNION with separate queries. The post provides a detailed example and a link to a Jupyter Notebook for practical implementation.
35

See all Daily Dose of Data Science | Avi Chawla | Substack archives