This post discusses 10 best practices for data science, including starting and staying organized, using version control, separating notebooks and source files, writing tests and sanity checks, automating the data pipeline, centralizing important parameters, making project runs verbose, and starting with a simple end-to-end pipeline. These practices promote reproducibility, collaboration, reliability, and efficiency in data science projects.

17m read timeFrom medium.com
Post cover image
Table of contents
Rule 1: Start Organized, Stay OrganizedRule 2: Everything Comes from Somewhere, and the Raw Data is ImmutableRule 3: Version Control is Basic ProfessionalismRule 4: Notebooks are for Exploration, Source Files are for RepetitionRule 5: Tests and Sanity Checks Prevent CatastrophesRule 6: Fail Loudly, Fail QuicklyRule 7: Project Runs are Fully Automated from Raw Data to Final OutputsRule 8: Important Parameters are Extracted and CentralizedRule 9: Project Runs are Verbose by Default and Result in Tangible ArtifactsRule 10: Start with the Simplest Possible End-to-End Pipeline
2 Comments

Sort: