10 Best Practices for Data Science

This post discusses 10 best practices for data science, including starting and staying organized, using version control, separating notebooks and source files, writing tests and sanity checks, automating the data pipeline, centralizing important parameters, making project runs verbose, and starting with a simple end-to-end pipeline. These practices promote reproducibility, collaboration, reliability, and efficiency in data science projects.

#data-science

#automation

Jun 17, 2024•17m read time•From medium.com

Table of contents

Rule 1: Start Organized, Stay Organized Rule 2: Everything Comes from Somewhere, and the Raw Data is Immutable Rule 3: Version Control is Basic Professionalism Rule 4: Notebooks are for Exploration, Source Files are for Repetition Rule 5: Tests and Sanity Checks Prevent Catastrophes Rule 6: Fail Loudly, Fail Quickly Rule 7: Project Runs are Fully Automated from Raw Data to Final Outputs Rule 8: Important Parameters are Extracted and Centralized Rule 9: Project Runs are Verbose by Default and Result in Tangible Artifacts Rule 10: Start with the Simplest Possible End-to-End Pipeline