Best of DuckDB — 2025

1
Article
Groww Engineering·25w
When Two Databases Become One: How DuckDB Saved Our Trading Operations from Manual Reconciliation
A trading platform faced recurring position-order mismatches across two separate MySQL databases, requiring 20-30 minutes of manual reconciliation by two engineers. By leveraging DuckDB's MySQL scanner extension to perform cross-database joins, they automated the entire process into a 2-3 minute operation running every 15 minutes. The solution eliminated manual intervention, improved accuracy from 85% to 99.9%, and enabled proactive monitoring instead of reactive fixes during market hours.
165
6
2
Article
DuckDB·1y
The DuckDB Local UI
DuckDB, in collaboration with MotherDuck, has introduced a built-in local UI available starting from DuckDB v1.2.1. This UI can be launched via terminal or a SQL command and offers features such as interactive notebooks, a column explorer, and detailed table summaries. It runs all queries locally, ensuring data privacy unless explicitly connected to MotherDuck. The UI is designed to be simple, fast, feature-rich, and fully open source.
133
7
3
Article
MotherDuck·20w
Stop Paying the Complexity Tax
Most organizations don't need massive distributed data systems. The industry has over-engineered solutions for edge cases, forcing everyone to pay a complexity tax for scale they'll never require. Modern single-machine databases can handle what previously required distributed systems, with machines now offering 192 cores and 1.5TB of memory. By separating storage (cheap, infinite object storage) from compute (ephemeral, cloneable instances), and designing for the common case of small data with occasional big compute needs, teams can achieve better performance with dramatically simpler architecture. DuckDB exemplifies this approach by focusing on the complete user experience, not just query performance, while MotherDuck extends it with cloud durability and per-user isolation through individual database instances that spin up in under 100ms.
84
4
Article
DuckDB·44w
DuckLake 0.2
DuckLake 0.2 introduces significant improvements including secrets management for credentials, enhanced Parquet file settings, relative schema/table paths for better organization, name mapping for existing Parquet files, scoped settings at schema and table levels, and partition transforms. The update includes automatic migration from v0.1 and adds new functions like ducklake_list_files for better system integration.
69
5
Video
Fireship·39w
DuckDB in 100 Seconds
DuckDB is an open-source, embeddable SQL database optimized for analytical workloads through columnar storage. Unlike SQLite's row-based approach, DuckDB stores data column-wise, enabling faster aggregations, filters, and joins on large datasets. It features vectorized query execution, multi-threading, and can directly query CSV and Parquet files. The database excels at time series analysis and is already used by major companies like Meta, Google, and Airbnb.
66
1
6
Article
DuckDB·34w
Announcing DuckDB 1.4.0
DuckDB 1.4.0 'Andium' introduces Long Term Support with 1 year community maintenance, database encryption using AES-256, MERGE statement for upsert operations, Iceberg write support, CLI progress bar with ETA, FILL window function for interpolation, and performance improvements including sorting rework and materialized CTEs. The release also includes macOS notarization and moves Python integration to a separate repository.
48
1
7
Article
MotherDuck·1y
Instant SQL is here: Speedrun ad-hoc queries as you type
Instant SQL is a new feature available in MotherDuck and DuckDB Local UI that provides real-time query result previews as you type, expediting the process of building and debugging SQL queries. This innovative tool is designed to maintain an analytical flow state, allowing for immediate visualization and modification of data which helps reduce the tedious cycle of drafting and debugging queries. Instant SQL supports various data sources and includes AI-powered inline edit suggestions for an enhanced user experience.
47
2
8
Article
DuckDB·48w
Faster Dashboards with Multi-Column Approximate Sorting
Advanced multi-column sorting techniques using space filling curves (Morton and Hilbert encodings) and truncated timestamps can significantly improve query performance on columnar data formats. These methods enable approximate sorting across multiple columns simultaneously, allowing diverse dashboard queries to benefit from min-max indexes and row group pruning. Experiments on flight data show Hilbert encoding provides the most consistent performance across different query patterns, while sorting by truncated timestamps (year-level granularity) combined with Hilbert encoding works best for time-filtered queries.
42
9
Article
DuckDB·22w
Announcing DuckDB 1.4.3 LTS
DuckDB 1.4.3 LTS is now available with important bugfixes addressing correctness issues in HAVING clauses, JOIN operations, and indexed table updates. The release introduces beta support for Windows ARM64, including native extension distribution and Python wheels via PyPI. Benchmarks on TPC-H SF100 show 24% performance improvement for native ARM64 compared to emulated AMD64 on Snapdragon-based systems. Additional fixes include race condition crashes, memory management improvements during WAL replay, and various edge cases in Unicode handling and Parquet exports.
40
1
10
Article
Towards Dev·1y
Building an End-to-End Data Lakehouse with Medalion Architecture, Airflow, and DuckDB
Learn how to build an end-to-end data lakehouse using Medalion architecture, Apache Airflow, and DuckDB. Understand the roles of the Bronze, Silver, and Gold layers in managing data quality and transformation. Discover why Apache Airflow is ideal for orchestrating workflows and how DuckDB serves as a high-performance analytical database for data warehousing.
40
1
11
Article
MotherDuck·42w
Summer Data Engineering Roadmap
A comprehensive 3-week structured learning roadmap for aspiring data engineers covering foundational skills (SQL, Git, Linux), core engineering concepts (Python, cloud platforms, data modeling), and advanced topics (streaming, data quality, DevOps). The guide provides curated resources and a progressive learning path from beginner to intermediate level, emphasizing practical skills needed for full-stack data engineering roles.
38
12
Article
DuckDB·51w
Announcing DuckDB 1.3.0
DuckDB version 1.3.0 introduces several new features and improvements, including external file caching, direct query capabilities via the CLI, Python-style lambda syntax, support for UUID v7, expression support in CREATE SECRET, and improved spatial join efficiency. The release also makes internal changes to enhance performance and reliability, particularly for Parquet file handling and string compression.
37
13
Article
Hacker News·45w
sirius-db/sirius
Sirius is a GPU-native SQL engine that integrates with existing databases like DuckDB through the Substrait query format. It delivers approximately 10x performance improvements over CPU-based query engines on TPC-H benchmarks while maintaining the same hardware costs. The system supports NVIDIA GPUs with compute capability 7.0+ and CUDA 11.2+, offering deployment options through AWS AMIs, Docker images, or manual installation. Sirius handles common SQL operations including filtering, joins, aggregations, and ordering, though it currently has limitations around data size constraints, row count limits, and partial NULL column support.
24
1
14
Article
DuckDB·26w
Announcing DuckDB 1.4.2 LTS
DuckDB 1.4.2 LTS is now available with critical security fixes for database encryption vulnerabilities, new Iceberg extension support for insert/update/delete operations, enhanced logging and profiling capabilities including HTTP request timing, and Vortex file format support. The release also includes performance optimizations for WAL index operations and database detachment, plus fixes for crashes, incorrect results, and storage issues.
23
15
Article
Community Picks·1y
DuckDB Database File as a New Standard for Sharing Data?
DuckDB offers a simplified approach to data sharing by encapsulating multiple tables into a single database file. This reduces compatibility issues and eliminates the need for packaging files into tar/zip archives. Tests showed that DuckDB handles numerical data more efficiently than PostgreSQL, while string data storage initially appeared less efficient but improved with larger datasets.
17
3
16
Article
dltHub·47w
Building Engine-Agnostic Data Stacks
Modern data teams often use multiple engines like Spark, DuckDB, and Snowflake, but struggle with data portability and code reusability across platforms. Apache Iceberg solves the storage problem by enabling safe data sharing between engines through ACID transactions and multi-engine coordination. Tools like Ibis complement this by providing engine-agnostic analytical code that runs on any supported backend without modification. Together, these technologies create truly portable data stacks where both data and business logic are decoupled from specific compute engines, reducing vendor lock-in and integration overhead.
16
17
Article
MotherDuck·36w
Announcing Pg_duckdb Version 1.0
Pg_duckdb version 1.0 is now available, bringing DuckDB's vectorized analytical engine directly into PostgreSQL as an extension. This integration enables faster analytical queries on PostgreSQL data without requiring separate data warehouses or complex ETL processes. The extension allows querying PostgreSQL tables with DuckDB's performance benefits, accessing external data lake files (Parquet, CSV, JSON), and joining local PostgreSQL data with remote cloud storage files in single queries. Performance improvements show up to 4x speedup with indexes and dramatic improvements for queries that previously timed out. The release includes enhanced MotherDuck integration for serverless analytics scaling.
14
18
Article
DuckDB·45w
Discovering DuckDB Use Cases via GitHub
DuckDB team demonstrates how to discover and analyze DuckDB usage across GitHub repositories by querying the GitHub API with DuckDB itself. The approach involves using DuckDB's HTTP capabilities to fetch repository data, processing JSON responses with SQL, and automating the workflow with GitHub Actions to generate daily reports in Markdown format. The solution includes pagination handling, data filtering, and visualization of historical trends through Git commit analysis.
14
1
19
Article
Debezium·1y
Real-time Data Replication with Debezium and Python
Change Data Capture (CDC) is essential for replicating operational data for analytics, and Debezium is a leading tool in this space, connecting to various databases and exporting CDC events in formats like JSON and Avro. This post demonstrates how to implement a Python-powered CDC pipeline using Debezium and pydbzengine, capturing change data from PostgreSQL and loading it into DuckDB with the Data Load Tool (DLT). The guide includes a code walkthrough, from setting up the environment and configuring Debezium to executing the pipeline and querying the results in DuckDB.
12
1
20
Article
Tigris·48w
Get your data ducks in a row with DuckLake
DuckLake is a new data lakehouse solution that separates metadata storage from data storage, storing metadata in SQL databases (Postgres, MySQL, DuckDB, SQLite) while keeping data in object storage. This architecture enables concurrent writes, eliminates egress fees when using services like Tigris, and allows querying from anywhere. The solution combines relational and non-relational data seamlessly, supports time-travel queries through snapshots, and can scale from laptop development to production workloads without complex infrastructure setup.
10
21
Article
DuckDB·52w
Machine Learning Prototyping with DuckDB and scikit-learn
This post explores how DuckDB, an efficient data management system, complements scikit-learn, a popular machine learning library, in developing a species prediction model using the Palmer Penguins dataset. Key steps include data preprocessing with DuckDB, model training using a Random Forest classifier, and three inference methods to achieve predictions: using Pandas, DuckDB UDF row by row, and DuckDB batch style. Performance implications of UDFs are discussed, highlighting their utility despite slower execution times compared to Pandas.
10
22
Article
DuckDB·1y
Parquet Bloom Filters in DuckDB
DuckDB now supports reading and writing Parquet Bloom filters, which help in selectively reading relevant data for queries by using compact index structures. The new feature is transparent to users and significantly improves query performance, especially in scenarios with large Parquet files or slow network connections. Bloom filters are supported for various data types, including integers, floating points, and strings, but not yet for nested types.
10

See all DuckDB archives