Best of Data Science — September 2024

1
Article
Machine Learning Mastery·2y
10 Machine Learning Algorithms Explained Using Real-World Analogies
The post explains 10 common machine learning algorithms using real-world analogies to make them easier to understand. It covers algorithms like Linear Regression, Logistic Regression, Decision Tree, Random Forest, Support Vector Machine, Naive Bayes, K-Nearest Neighbors, K-means, Principal Component Analysis, and Gradient Boosting, providing everyday examples to illustrate how each algorithm functions.
258
6
2
Article
Javarevisited·2y
The 2024 Machine Learning Engineer RoadMap
The 2024 Machine Learning Engineer RoadMap offers a comprehensive guide to becoming a professional in the field. Starting with foundational languages like Python and R, it recommends essential courses and libraries such as NumPy, Pandas, and Matplotlib for data pre-processing and visualization. The road map details various types of machine learning techniques, including supervised, unsupervised, and reinforcement learning, with course recommendations for deeper understanding. It emphasizes the growing opportunities in the field and provides a curated set of resources for aspiring engineers.
238
3
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
15 DS/ML Cheat Sheets
This post collates 15 cheat sheets covering essential data science and machine learning concepts. It includes resources on translating between different data manipulation libraries, multi-GPU training strategies, testing ML models in production, neural network optimization, and more. Detailed links are provided for further reading.
232
1
4
Video
YouTube·2y
All Machine Learning algorithms explained in 17 min
Tim, a data scientist with over 10 years of experience, offers an intuitive overview of critical machine learning algorithms to help you choose the right one for your problem. The post covers supervised learning (like regression and classification), unsupervised learning (like clustering), and dives into specific algorithms such as linear regression, logistic regression, K-nearest neighbors (KNN), support vector machine (SVM), naive Bayes classifier, decision trees, random forests, boosting, neural networks, and dimensionality reduction. Each algorithm is explained with examples to build an intuitive understanding of their functions and applications.
209
3
5
Article
Medium·2y
Teaching Your Model to Learn from Itself
In machine learning, labeling data can be expensive and time-consuming. Pseudo-labeling offers a solution by using confident predictions on unlabeled data to iteratively improve model accuracy. In a case study using the MNIST dataset, applying iterative, confidence-based pseudo-labeling increased model accuracy from 90% to 95%. Key strategies include maintaining rigorous thresholds, continuous performance evaluation, and incorporating human feedback for low-confidence data.
94
1
6
Video
freeCodeCamp·2y
End-to-End Machine Learning Project – AI, MLOps
The post provides a comprehensive guide on undertaking an end-to-end machine learning project focused on house price prediction. It delves into core machine learning concepts, data analysis, feature engineering, and model implementation with robust testing. Additionally, it emphasizes MLOps integrations using tools like ZenML and MLFlow for experiment tracking and deployment. The tutorial also underscores the importance of writing scalable and readable code by employing design patterns such as Factory and Strategy patterns. The project aims to differentiate itself by focusing on thorough data understanding and robust implementation practices, promising to enhance one's data science portfolio and career prospects.
76
7
Article
Machine Learning Mastery·2y
Automating Data Cleaning Processes with Pandas
Discover how to automate data cleaning processes using the Pandas library. Learn about typical data cleaning functions like filling missing values, removing duplicates, manipulating strings, and converting date formats. The post also introduces a custom class, DataCleaner, to encapsulate these steps into a reusable pipeline for an efficient and systematic approach to data cleaning.
64
8
Video
freeCodeCamp·2y
Kaggle Data Science Competition Course – Solve Three Challenges Step-by-Step
Enhance your data science skills by tackling Kaggle competitions, with Rohan Kumar's step-by-step course guiding you through solving three distinct Kaggle problems. This comprehensive tutorial covers project setup, data preprocessing, feature engineering, and model evaluation. It also emphasizes the importance of understanding each dataset thoroughly to create effective solutions.
63
9
Article
JetBrains·2y
How to Use FastAPI for Machine Learning
FastAPI is a user-friendly Python web framework ideal for quickly building backend services. It is widely adopted by companies like Microsoft and Netflix, offering benefits such as easy API creation, fast documentation, and simple testing. FastAPI allows data scientists with little to no backend experience to deploy prediction models, suggestion engines, and dynamic reporting systems efficiently. This guide walks through using FastAPI to create machine learning project APIs, including setup, environment dependency, handling request responses, and advanced functionalities like image classification and retraining models with background tasks.
58
1
10
Article
Changelog·2y
AI is more than GenAI (Practical AI #285)
AI encompasses more than just Generative AI (GenAI). Daniel Whitenack breaks down the history and development of data science, machine learning, AI, and GenAI to help listeners understand the AI ecosystem holistically, including models, embeddings, data, and prompts.
44
11
Article
Javarevisited·2y
Top 5 Courses to Learn Artificial Intelligence on Educative.io in 2024
AI is a crucial skill in 2024, revolutionizing industries and enhancing productivity. The post lists the top five AI courses on Educative.io, suitable for beginners and intermediates looking for hands-on, text-based learning. Courses include Grokking AI for Engineering & Product Managers, Machine Learning Handbook, Build Your Own Chatbot in Python, Introduction to Prompt Engineering with Llama 3, and Data Science Interview Handbook. Educative.io offers a range of comprehensive, practical courses and is currently offering a 50% discount on their Unlimited subscription.
42
12
Article
Hugging Face·2y
Fine-tuning LLMs to 1.58bit: extreme quantization made easy
As large language models (LLMs) grow, reducing their computational and energy costs via quantization becomes crucial. BitNet, a new transformer architecture from Microsoft Research, drastically cuts computational costs by representing parameters with ternary values (-1, 0, 1) at 1.58 bits per parameter. The post details how existing models, like Llama3, can be fine-tuned using BitNet, achieving efficient performance while maintaining accuracy. The article also covers the implementation, optimization, and benchmarking of custom inference kernels, making LLMs more scalable and practical.
40
1
13
Article
Community Picks·2y
tcsenpai/goldigger
goldigger is a Python-based tool for stock price prediction using machine learning models like LSTM, GRU, Random Forest, and XGBoost. It retrieves historical data from Yahoo Finance, incorporates technical indicators, and uses ensemble prediction to combine model results. It also features hyperparameter tuning, time series cross-validation, risk metrics calculation, and performance visualization. Designed for educational use, it provides customization through command-line arguments.
40
2
14
Article
This Dot·2y
Integrating AI Models Locally with Next.js ft. Jesus Padron
Jan Zirnstein discusses the importance of data governance, the need for leader education on AI technologies, and methods for addressing biases in AI training data. He highlights that organizations must ensure data accuracy, security, and ethical use while continuously validating AI models to foster trust and drive innovation.
37
15
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
Accelerate Pandas 20x using FireDucks
FireDucks is a highly optimized alternative to Pandas, boasting up to 20x performance improvements by leveraging multi-core CPU capabilities and lazy execution. With the same API as Pandas, FireDucks allows for seamless integration into existing Pandas pipelines by simply changing the import statement. The library is currently available for Linux x86_64, with versions for Windows and MacOS in development.
34
16
Article
Medium·2y
Mastering Web Scraping: From Bypassing CAPTCHAs to Building Simple Scrapers
Web scraping is a powerful tool for automating data collection, overcoming security measures like CAPTCHAs and Cloudflare, and building scrapers. This post covers various techniques to bypass CAPTCHAs using Python libraries, create voting bots with tools like Automatio.ai, and construct simple scrapers with Python and JavaScript. Ethical practices in web scraping are emphasized, along with real-life case studies demonstrating the practical benefits of using Python for data analysis and cost savings.
28
17
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
A Crash Course on Graph Neural Networks — Part 3
Part 3 of the crash course on Graph Neural Networks covers advanced methods for graph learning and several feature engineering techniques, along with implementation details. The course aims to provide a beginner-friendly introduction to GNNs, highlighting their importance in big-tech ML applications and outlining the benefits and challenges of using graph data. Key topics include GNN tasks, data challenges, frameworks, advanced architectures, and practical demos.
27
18
Article
Planet Python·2y
Using Pandas to Read JSON from URL
Learn how to use Pandas in Python to read JSON data directly from a URL into a DataFrame. This tutorial covers a basic example and explains the key parameters of the `pd.read_json()` method, enabling customization of the data reading process.
26
2
19
Article
freeCodeCamp·2y
From PhD drop-out to Google Data Scientist with Megan Risdal [Podcast #142]
Megan Risdal, a data scientist and Product Manager at Google's Kaggle, discusses the platform that hosts 300k open data sets and runs weekly data science competitions. She also compares the communication styles in academia versus tech, and contrasts her experiences with Stack Overflow and Kaggle. Additionally, Megan touches upon the importance of linguistics in AI research and her work on Google's Gemma open models project.
25
20
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
How to Inspect Decision Trees After Training with PCA
Decision trees often create perpendicular split conditions which can lead to overfitting, particularly with diagonal decision boundaries. Running PCA before fitting a decision tree can project data into orthogonal space, potentially reducing the tree's depth and improving performance. However, PCA components are not interpretable, which can be a limitation in some cases. Proper feature engineering might be necessary for better model performance.
25
1
21
Video
Tech With Tim·2y
These Coding Niches Will Make You $$$$ in 2024
Explore five high-paying coding niches for 2024: artificial intelligence and machine learning, data science, blockchain development, cyber security, and DevOps. From creating algorithms for AI to protecting sensitive data in cyber security, these fields offer lucrative opportunities for tech professionals.
24
22
Article
Hacker News·2y
satmihir/fair: A Go library for serving resources fairly
FAIR is a Go library designed to distribute limited resources evenly across multiple clients in resource-constrained environments. It uses a modified Stochastic Fair BLUE algorithm for network congestion control and a multi-level Bloom Filter for efficiency. It provides an easy integration, automatic tuning, and scalability to large numbers of clients. The library ensures fairness in resource allocation and prevents over-allocation and starvation.
23
23
Video
IBM Technology·2y
RAG vs. Fine Tuning
Retrieval augmented generation (RAG) and fine-tuning are two techniques for enhancing large language models. RAG retrieves external, up-to-date information to augment responses, making it effective for dynamic data sources and mitigating model hallucinations. Fine-tuning adapts a model to a specific domain or style by incorporating labeled and targeted data into the model's weights, providing more specialized and consistent outputs. Both techniques have their strengths and weaknesses, and the choice between them or a combination depends on specific use cases, data requirements, and desired model behavior.
23
24
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
15 Ways to Optimize Neural Network Training (With Implementation)
Discover 15 techniques to optimize neural network training, complete with code examples. Understanding and applying these techniques is crucial for ML engineers to efficiently manage model training processes, save operational costs, and add genuine value. The post emphasizes the importance of identifying bottlenecks, selecting appropriate techniques, and considering trade-offs and hardware limitations.
22
25
Article
Towards AI·2y
Why OpenAI’s o1 Model Is A Scam
OpenAI's o1 model claims to advance AI by making it think before responding, using the Chain of Thought (CoT) technique. However, the author argues that the model is mostly a repackaged marketing ploy, as CoT has been around for years. The post includes a Python implementation of CoT and discusses the potential benefits of OpenAI's reinforcement learning for better intermediate step performance. Readers are advised to critically evaluate such new features before committing financially.
21
3

See all Data Science archives