Best of Daily Dose of Data Science | Avi Chawla | SubstackJune 2024

  1. 1
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·2y

    20 Most Common Magic Methods

    Discover the 20 most common magic methods used in Python OOP, including __new__, __init__, and __str__. Learn how to use these methods and their importance in Python projects.

  2. 2
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·2y

    4 Ways to Test ML Models in Production

    Testing ML models in production is crucial to ensure reliability and performance on real-world data. Four common strategies are A/B testing, canary testing, interleaved testing, and shadow testing. A/B testing distributes requests non-uniformly between models, while canary testing gradually rolls out the candidate model to a subset of users. Interleaved testing mixes predictions from both models, and shadow testing logs outputs without affecting user experience. These techniques help mitigate risks and validate the model effectively.

  3. 3
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·2y

    Poisson Regression vs. Linear Regression

    Linear regression may not be suitable for count data as it can produce negative predictions, which don't make sense for certain types of data like the number of calls received. Poisson regression, a type of generalized linear model (GLM), is better suited for count-based responses as it assumes the data follows a Poisson distribution. It ensures non-negative predictions and acknowledges that outcomes are not equally likely around the mean.

  4. 4
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·2y

    7 Categorical Data Encoding Techniques

    The post outlines seven techniques for encoding categorical data, including one-hot encoding, dummy encoding, effect encoding, label encoding, ordinal encoding, count encoding, and binary encoding. Each method is briefly explained along with the number of resulting features. The post also mentions the category-encoders library for more techniques and encourages reader interaction by asking for additional methods.

  5. 5
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·2y

    Data Version Control

    Data version control is a critical skill for data scientists working on ML projects. It helps with versioning large datasets, ensuring reproducibility and experiment traceability. Git is not suitable for versioning datasets due to file size limitations, and data version control solves this problem.

  6. 6
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·2y

    4 Strategies for Multi-GPU Training

    This post discusses four strategies for multi-GPU training: model parallelism, tensor parallelism, data parallelism, and pipeline parallelism.