Best of Data Processing — August 2024

1
Article
Towards AI·2y
The Best Practices of RAG
Explores the process of retrieval-augmented generation (RAG) and outlines best practices for its various components. Discusses query classification, efficient document retrieval, re-ranking for relevance, re-packing into structured formats, and summarization to extract key information. The post also provides a comprehensive evaluation of these practices and concludes with insights and recommendations.
90
3
2
Article
Lobsters·2y
CSVs Are Kinda Bad. DSVs Are Kinda Good.
CSVs often pose challenges with different delimiters, escape characters, and newline conventions, leading to malformed data and parsing issues. Using ASCII control characters as delimiters, like unit and record separators, can simplify data parsing by avoiding conflicts with printable characters. However, there is limited tool support for these delimiters compared to CSVs, which are widely supported despite their fragility.
52
6
3
Article
Medium·2y
High-Performance Python Data Processing: pandas 2 vs. Polars, a vCPU Perspective
Polars is emerging as a strong competitor to pandas for Python data analysis, boasting significant performance improvements due to its Rust backend optimized for parallel processing and vectorized operations. This post tests Polars against pandas with varying vCores, finding Polars generally faster, though it encounters some challenges with single vCore setups. While Polars shows great promise, considerations like cost, compatibility, and maturity remain important when evaluating a switch from pandas.
42
4
Article
Towards Dev·2y
Spark — Beyond Basics: Hidden actions in your spark code
The post discusses hidden actions that can be mistaken for transformations in Apache Spark. It uses examples from Spark code snippets, such as `read.csv()`, `df.groupby().pivot()`, and `foreach()`, to explain how certain operations trigger jobs. Key insights include the impact of the inferSchema option turning a transformation into an action, and the unique behavior of pivot and foreach actions.
15
5
Article
DuckDB·2y
DuckDB Tricks – Part 1
This post outlines five useful operations for working with DuckDB, including data shuffling, copying table schemas, specifying CSV data types, updating CSV files in-place, and pretty-printing floating-point numbers. Various SQL snippets demonstrate how to achieve these operations efficiently.
12
1

See all Data Processing archives