Best of Data ScienceSeptember 2025

  1. 1
    Article
    Avatar of koaningVincent D. Warmerdam·35w

    python data tools live

    A brief personal post sharing an inside joke from the author's office related to Python data tools, mentioned in the context of an upcoming team offsite.

  2. 2
    Article
    Avatar of huggingfaceHugging Face·37w

    Jupyter Agents: training LLMs to reason with notebooks

    Hugging Face developed Jupyter Agent, a system that trains small language models to perform data science tasks by executing code in Jupyter notebooks. They created a comprehensive pipeline starting with 2TB of Kaggle notebooks, applied deduplication and quality filtering, generated synthetic question-answer pairs, and fine-tuned Qwen3-4B models. The approach achieved 75% accuracy on easy DABStep benchmark tasks, demonstrating that smaller models can become effective data science agents with proper training data and scaffolding. The project includes open-source datasets, trained models, and a simplified 200-line scaffolding system.

  3. 3
    Article
    Avatar of palindromeThe Palindrome·37w

    Correlation vs. cosine similarity

    Explores the key differences between Pearson correlation and cosine similarity, two statistical measures for quantifying relationships between variables. While both are based on dot products, correlation performs double normalization (mean-centering and variance scaling) while cosine similarity only normalizes by magnitude. Through mathematical explanations and Python simulations, the post demonstrates that these measures can yield dramatically different results depending on data scaling and offsets. Correlation is recommended when measurement units are arbitrary or different, while cosine similarity is preferred when variables share meaningful units, particularly in machine learning applications with vector embeddings.

  4. 4
    Article
    Avatar of planetpythonPlanet Python·37w

    Python Memory Tricks to Boost Performance

    Comprehensive guide covering practical Python memory optimization techniques including generators for lazy loading, __slots__ for reducing object overhead, weak references for cache management, string interning, smart data structure choices, chunked file processing, and leveraging Python 3.13's mimalloc. Includes ready-to-use code examples and memory profiling tools to help developers reduce RAM usage by 40-60% in large applications.