Best of Daily Dose of Data Science | Avi Chawla | Substack — June 2024
- 1
- 2
Daily Dose of Data Science | Avi Chawla | Substack·2y
4 Ways to Test ML Models in Production
Testing ML models in production is crucial to ensure reliability and performance on real-world data. Four common strategies are A/B testing, canary testing, interleaved testing, and shadow testing. A/B testing distributes requests non-uniformly between models, while canary testing gradually rolls out the candidate model to a subset of users. Interleaved testing mixes predictions from both models, and shadow testing logs outputs without affecting user experience. These techniques help mitigate risks and validate the model effectively.
- 3
Daily Dose of Data Science | Avi Chawla | Substack·2y
Poisson Regression vs. Linear Regression
Linear regression may not be suitable for count data as it can produce negative predictions, which don't make sense for certain types of data like the number of calls received. Poisson regression, a type of generalized linear model (GLM), is better suited for count-based responses as it assumes the data follows a Poisson distribution. It ensures non-negative predictions and acknowledges that outcomes are not equally likely around the mean.
- 4
Daily Dose of Data Science | Avi Chawla | Substack·2y
7 Categorical Data Encoding Techniques
The post outlines seven techniques for encoding categorical data, including one-hot encoding, dummy encoding, effect encoding, label encoding, ordinal encoding, count encoding, and binary encoding. Each method is briefly explained along with the number of resulting features. The post also mentions the category-encoders library for more techniques and encourages reader interaction by asking for additional methods.
- 5
Daily Dose of Data Science | Avi Chawla | Substack·2y
Data Version Control
Data version control is a critical skill for data scientists working on ML projects. It helps with versioning large datasets, ensuring reproducibility and experiment traceability. Git is not suitable for versioning datasets due to file size limitations, and data version control solves this problem.
- 6