A practical introduction to dataframe-oriented programming, arguing that dataframes should be used more broadly beyond data teams. Covers the core idea of keeping data inside the dataframe and avoiding loops, with code examples in pandas, PySpark, SQL, and R. Explains key preprocessing techniques like melting wide-to-long format, flattening nested JSON, and handling nulls. Also covers performance considerations including avoiding Python loops (showing ~500x slowdown vs native pandas ops), data representation optimizations (booleans, integers, categoricals), and tradeoffs between lazy vs strict evaluation and single-node vs distributed systems. Ends with a light promotion of csvbase, the author's open-source data sharing tool.

20m read timeFrom csvbase.com
Post cover image
Table of contents
Worked examples of "dataframe-oriented programming"Hint 1: Pre-processing mattersHint 2: Dataframe libraries differ, a bitHint 3: Speed does matterHint 4: Use csvbase.comThe future of the past/past of the futureSee also

Sort: