Best of Data Science — September 2025
- 1
- 2
Hugging Face·37w
Jupyter Agents: training LLMs to reason with notebooks
Hugging Face developed Jupyter Agent, a system that trains small language models to perform data science tasks by executing code in Jupyter notebooks. They created a comprehensive pipeline starting with 2TB of Kaggle notebooks, applied deduplication and quality filtering, generated synthetic question-answer pairs, and fine-tuned Qwen3-4B models. The approach achieved 75% accuracy on easy DABStep benchmark tasks, demonstrating that smaller models can become effective data science agents with proper training data and scaffolding. The project includes open-source datasets, trained models, and a simplified 200-line scaffolding system.
- 3
The Palindrome·37w
Correlation vs. cosine similarity
Explores the key differences between Pearson correlation and cosine similarity, two statistical measures for quantifying relationships between variables. While both are based on dot products, correlation performs double normalization (mean-centering and variance scaling) while cosine similarity only normalizes by magnitude. Through mathematical explanations and Python simulations, the post demonstrates that these measures can yield dramatically different results depending on data scaling and offsets. Correlation is recommended when measurement units are arbitrary or different, while cosine similarity is preferred when variables share meaningful units, particularly in machine learning applications with vector embeddings.
- 4
Planet Python·37w
Python Memory Tricks to Boost Performance
Comprehensive guide covering practical Python memory optimization techniques including generators for lazy loading, __slots__ for reducing object overhead, weak references for cache management, string interning, smart data structure choices, chunked file processing, and leveraging Python 3.13's mimalloc. Includes ready-to-use code examples and memory profiling tools to help developers reduce RAM usage by 40-60% in large applications.