Best of Data ScienceJune 2025

  1. 1
    Article
    Avatar of dailydoseofdsDaily Dose of Data Science | Avi Chawla | Substack·51w

    48 Most Popular Open ML Datasets

    A comprehensive compilation of 48 widely-used open machine learning datasets organized by domain including computer vision (ImageNet, COCO), natural language processing (SQuAD, GLUE), recommendation systems (MovieLens, new Yambda-5B), tabular data (UCI datasets, Titanic), reinforcement learning (OpenAI Gym), and multimodal learning (LAION-5B, VQA). Each dataset is briefly described with its primary use case and key characteristics, serving as a reference guide for researchers and practitioners selecting appropriate datasets for their ML projects.

  2. 2
    Article
    Avatar of lpythonLearn Python·50w

    Data Science Roadmap

  3. 3
    Video
    Avatar of bycloudbycloud·49w

    1-Bit LLM: The Most Efficient LLM Possible?

    BitNet introduces 1-bit quantization for large language models, reducing memory usage by up to 7 times and energy consumption by 12 times compared to full-precision models. The technique uses ternary weights (-1, 0, 1) instead of traditional 16-bit floating point numbers, enabling efficient matrix operations through simple addition and subtraction. Recent advances include BitNet B1.58 with sparsity support and A4.8 with 4-bit activations and 3-bit KV cache, allowing 5x larger context windows. A 2B parameter BitNet model achieves comparable performance to much larger models while requiring only 0.44GB memory footprint and costing around $1.3K to train versus $26K for traditional approaches.

  4. 4
    Article
    Avatar of palindromeThe Palindrome·50w

    The Anatomy of Logistic Regression

    Logistic regression transforms geometric relationships into probability predictions through a step-by-step process. Starting with linear transformation (ax + b) to create logits, the model applies exponential functions and sigmoid activation to map any real number to a probability between 0 and 1. The geometric aspect becomes clear in higher dimensions where the decision boundary forms lines or planes, with logits representing signed distance from these boundaries. This fundamental approach demonstrates how machine learning models convert spatial relationships into probabilistic predictions.

  5. 5
    Article
    Avatar of 80lv80 LEVEL·50w

    Borderlands 2 Goes Free on Steam Amidst Intense Review Bombing

    Borderlands 2 became free on Steam but faced massive review bombing due to controversial Terms of Service updates by Take-Two and 2K. The new EULA introduces extensive data collection practices, bans mods, includes forced arbitration clauses, and restricts VPN usage. Players are protesting these privacy-invasive changes, resulting in only 18% positive recent reviews despite the free game offer.

  6. 6
    Article
    Avatar of xkcdxkcd·49w

    xkcd: Tukey

    An xkcd comic referencing John Tukey, the influential statistician known for developing numerous statistical methods and concepts including the box plot, FFT algorithm, and exploratory data analysis techniques.

  7. 7
    Article
    Avatar of communityCommunity Picks·49w

    Forms that do it all

    Fillout is a form builder platform that enables users to create customizable forms, surveys, and quizzes with drag-and-drop functionality. The service offers 50+ field types, intelligent routing for automation, real-time collaboration features, and analytics. It provides a free tier with unlimited forms, 1000 monthly submissions, and unlimited team seats.

  8. 8
    Article
    Avatar of huggingfaceHugging Face·49w

    (LoRA) Fine-Tuning FLUX.1-dev on Consumer Hardware

    QLoRA enables fine-tuning of FLUX.1-dev diffusion models on consumer hardware with under 10GB VRAM by combining 4-bit quantization with Low-Rank Adaptation. The approach uses bitsandbytes for quantization, 8-bit AdamW optimizer, gradient checkpointing, and cached latents to dramatically reduce memory usage from ~120GB to ~9GB. Training on RTX 4090 takes 41 minutes for 700 steps, while FP8 training with torchao on H100 reduces time to 20 minutes. The technique maintains high-quality results while making advanced model customization accessible to developers without enterprise-grade hardware.

  9. 9
    Video
    Avatar of youtubeYouTube·48w

    How I’d Learn AI in 2025 (If I Could Start Over)

    A comprehensive roadmap for learning AI in 2025, starting with foundational mathematics (linear algebra, calculus, probability) and Python programming. The guide covers essential data science libraries (NumPy, Pandas, Matplotlib), then progresses through machine learning concepts including supervised learning (regression, classification algorithms), unsupervised learning (clustering), and reinforcement learning. It explains neural networks, deep learning architectures, and modern generative AI including transformers and large language models. The content includes practical learning resources like Khan Academy, Free Code Camp, and specific YouTube channels for hands-on implementation.

  10. 10
    Article
    Avatar of tdsTowards Data Science·48w

    Part 2: Matrix-Matrix Multiplication

    Matrix-matrix multiplication is explained through visual X-diagrams that show how input values flow through transformations. The article derives the multiplication formula by demonstrating that multiplying matrices A*B creates a combined transformation equivalent to applying B first, then A. This visualization clearly explains why matrix multiplication is non-commutative (A*B ≠ B*A) and shows how special matrices like scale, shift, permutation, and triangular matrices behave when multiplied together, with their properties preserved in the resulting products.