Best of Data AnalysisJune 2025

  1. 1
    Article
    Avatar of mlmMachine Learning Mastery·51w

    10 Python One-Liners That Will Simplify Feature Engineering

    Ten practical Python one-liners for feature engineering tasks including standardization, min-max scaling, polynomial features, one-hot encoding, discretization, logarithmic transformation, ratio creation, low variance removal, multiplicative interactions, and outlier tracking. Each technique uses popular libraries like scikit-learn and pandas to transform raw data into meaningful features for machine learning models.

  2. 2
    Article
    Avatar of duckdbDuckDB·50w

    Faster Dashboards with Multi-Column Approximate Sorting

    Advanced multi-column sorting techniques using space filling curves (Morton and Hilbert encodings) and truncated timestamps can significantly improve query performance on columnar data formats. These methods enable approximate sorting across multiple columns simultaneously, allowing diverse dashboard queries to benefit from min-max indexes and row group pruning. Experiments on flight data show Hilbert encoding provides the most consistent performance across different query patterns, while sorting by truncated timestamps (year-level granularity) combined with Hilbert encoding works best for time-filtered queries.

  3. 3
    Article
    Avatar of aiAI·51w

    High performance real-time knowledge graph open source stack with LLM, Kuzu and CocoIndex

    CocoIndex now supports Kuzu as a target graph database, creating a complete open-source stack for building high-performance knowledge graphs with real-time updates. The integration allows developers to use LLMs for extracting relationships from documents and storing them in Kuzu with just ~200 lines of Python code. The framework follows a dataflow programming model where developers focus on transformations while CocoIndex handles data operations automatically. The stack includes data ingestion, transformation, graph storage, and visualization tools, with seamless switching between different graph databases like Neo4j and Kuzu.

  4. 4
    Article
    Avatar of duckdbDuckDB·47w

    Discovering DuckDB Use Cases via GitHub

    DuckDB team demonstrates how to discover and analyze DuckDB usage across GitHub repositories by querying the GitHub API with DuckDB itself. The approach involves using DuckDB's HTTP capabilities to fetch repository data, processing JSON responses with SQL, and automating the workflow with GitHub Actions to generate daily reports in Markdown format. The solution includes pagination handling, data filtering, and visualization of historical trends through Git commit analysis.

  5. 5
    Article
    Avatar of databricksdatabricks·50w

    Introducing Databricks One

    Databricks announces Databricks One, a simplified business intelligence experience designed for non-technical business users. The platform provides secure access to AI-powered dashboards, Genie spaces for conversational data queries, and Databricks Apps through an intuitive interface. Built on Unity Catalog for governance, it enables business teams to access data insights without technical expertise. The full experience enters beta this summer, while consumer access entitlements are available now at no additional cost.

  6. 6
    Article
    Avatar of lnLaravel News·50w

    Remove Collection Items Directly with Laravel's forget Method

    Laravel's forget method removes items from collections by their keys while modifying the original collection in place. It accepts single keys or arrays of keys for removal, making it useful for shopping cart management, cleanup operations, and preference handling without creating new collection instances.

  7. 7
    Article
    Avatar of tigrisTigris·50w

    Get your data ducks in a row with DuckLake

    DuckLake is a new data lakehouse solution that separates metadata storage from data storage, storing metadata in SQL databases (Postgres, MySQL, DuckDB, SQLite) while keeping data in object storage. This architecture enables concurrent writes, eliminates egress fees when using services like Tigris, and allows querying from anywhere. The solution combines relational and non-relational data seamlessly, supports time-travel queries through snapshots, and can scale from laptop development to production workloads without complex infrastructure setup.