Best of Data Analysis — June 2025

1
Article
Machine Learning Mastery·51w
10 Python One-Liners That Will Simplify Feature Engineering
Ten practical Python one-liners for feature engineering tasks including standardization, min-max scaling, polynomial features, one-hot encoding, discretization, logarithmic transformation, ratio creation, low variance removal, multiplicative interactions, and outlier tracking. Each technique uses popular libraries like scikit-learn and pandas to transform raw data into meaningful features for machine learning models.
82
2
Article
DuckDB·50w
Faster Dashboards with Multi-Column Approximate Sorting
Advanced multi-column sorting techniques using space filling curves (Morton and Hilbert encodings) and truncated timestamps can significantly improve query performance on columnar data formats. These methods enable approximate sorting across multiple columns simultaneously, allowing diverse dashboard queries to benefit from min-max indexes and row group pruning. Experiments on flight data show Hilbert encoding provides the most consistent performance across different query patterns, while sorting by truncated timestamps (year-level granularity) combined with Hilbert encoding works best for time-filtered queries.
42
3
Article
AI·51w
High performance real-time knowledge graph open source stack with LLM, Kuzu and CocoIndex
CocoIndex now supports Kuzu as a target graph database, creating a complete open-source stack for building high-performance knowledge graphs with real-time updates. The integration allows developers to use LLMs for extracting relationships from documents and storing them in Kuzu with just ~200 lines of Python code. The framework follows a dataflow programming model where developers focus on transformations while CocoIndex handles data operations automatically. The stack includes data ingestion, transformation, graph storage, and visualization tools, with seamless switching between different graph databases like Neo4j and Kuzu.
29
1
4
Article
DuckDB·47w
Discovering DuckDB Use Cases via GitHub
DuckDB team demonstrates how to discover and analyze DuckDB usage across GitHub repositories by querying the GitHub API with DuckDB itself. The approach involves using DuckDB's HTTP capabilities to fetch repository data, processing JSON responses with SQL, and automating the workflow with GitHub Actions to generate daily reports in Markdown format. The solution includes pagination handling, data filtering, and visualization of historical trends through Git commit analysis.
14
1
5
Article
databricks·50w
Introducing Databricks One
Databricks announces Databricks One, a simplified business intelligence experience designed for non-technical business users. The platform provides secure access to AI-powered dashboards, Genie spaces for conversational data queries, and Databricks Apps through an intuitive interface. Built on Unity Catalog for governance, it enables business teams to access data insights without technical expertise. The full experience enters beta this summer, while consumer access entitlements are available now at no additional cost.
12
6
Article
Laravel News·50w
Remove Collection Items Directly with Laravel's forget Method
Laravel's forget method removes items from collections by their keys while modifying the original collection in place. It accepts single keys or arrays of keys for removal, making it useful for shopping cart management, cleanup operations, and preference handling without creating new collection instances.
12
1
7
Article
Tigris·50w
Get your data ducks in a row with DuckLake
DuckLake is a new data lakehouse solution that separates metadata storage from data storage, storing metadata in SQL databases (Postgres, MySQL, DuckDB, SQLite) while keeping data in object storage. This architecture enables concurrent writes, eliminates egress fees when using services like Tigris, and allows querying from anywhere. The solution combines relational and non-relational data seamlessly, supports time-travel queries through snapshots, and can scale from laptop development to production workloads without complex infrastructure setup.
10

See all Data Analysis archives