Best of DuckDB2024

  1. 1
    Article
    Avatar of duckdbDuckDB·1y

    Analyzing Open Government Data with duckplyr

    duckplyr is a high-performance, drop-in replacement for dplyr in R, powered by DuckDB. This post demonstrates how to use duckplyr to clean and analyze an open data set from New Zealand's government, showcasing the library's capabilities for efficient data wrangling and analysis. With enhanced CSV parsing and holistic optimization, duckplyr ensures faster and more ergonomic handling of large datasets compared to dplyr.

  2. 2
    Article
    Avatar of duckdbDuckDB·1y

    DuckDB Tricks – Part 3

    This blog post delves into various advanced features and performance optimization techniques for DuckDB, particularly focusing on convenient methods for handling table operations and improving the processing speed of Parquet and CSV files. It includes practical examples using the Dutch railway services dataset, demonstrating column renaming with pattern matching, data loading with globbing, reordering Parquet files, and employing Hive partitioning to speed up queries significantly.

  3. 3
    Article
    Avatar of duckdbDuckDB·2y

    DuckDB Tricks – Part 1

    This post outlines five useful operations for working with DuckDB, including data shuffling, copying table schemas, specifying CSV data types, updating CSV files in-place, and pretty-printing floating-point numbers. Various SQL snippets demonstrate how to achieve these operations efficiently.

  4. 4
    Article
    Avatar of duckdbDuckDB·1y

    CSV Files: Dethroning Parquet as the Ultimate Storage File Format — or Not?

    Data storage formats like CSV and Parquet serve different purposes in data analytics. CSV files are human-readable and easy to use but are inefficient and hard to parallelize. Parquet files, on the other hand, are highly efficient due to their columnar storage, compression techniques, and well-defined schema, making them better suited for data analysis. DuckDB has recently improved its CSV reader, making it more efficient and easier to use, but Parquet still holds a performance edge, especially in terms of query execution. The article concludes that while CSV files have their place for flexibility, Parquet files remain superior for most analytical tasks.