Using dataframes to write smaller, faster programs

Cal Paterson

A practical introduction to dataframe-oriented programming, arguing that dataframes should be used more broadly beyond data teams. Covers the core idea of keeping data inside the dataframe and avoiding loops, with code examples in pandas, PySpark, SQL, and R. Explains key preprocessing techniques like melting wide-to-long format, flattening nested JSON, and handling nulls. Also covers performance considerations including avoiding Python loops (showing ~500x slowdown vs native pandas ops), data representation optimizations (booleans, integers, categoricals), and tradeoffs between lazy vs strict evaluation and single-node vs distributed systems. Ends with a light promotion of csvbase, the author's open-source data sharing tool.

Take the tools out of 'Data', but don't take the data out of the tools

Worked examples of "dataframe-oriented programming"

Hint 2: Dataframe libraries differ, a bit

The future of the past/past of the future