Best of Data Science — December 2024

1
Article
Machine Learning Mastery·1y
7 Machine Learning Projects For Beginners
Explore seven beginner-friendly machine learning projects to gain real-world experience and enhance your career prospects. Projects include Titanic Survival Prediction, Stock Price Prediction, Email Spam Classifier, Handwritten Digit Recognition, Movie Recommendation System, Customer Churn Prediction, and Face Detection. These projects will teach you important ML skills such as data preparation, classification, regression, computer vision, and natural language processing.
243
4
2
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
RAG vs Agentic RAG
Agentic RAG systems introduce dynamic, adaptable behaviors into the traditional RAG workflow. Unlike traditional RAG, which retrieves and generates once, agentic RAGs iteratively refine queries and context, adapting based on the problem's complexity. This makes them more effective for complex queries and problem-solving. The open-source tool Opik by CometML supports the evaluation, testing, and monitoring of LLM applications from development to production, offering features like logging traces and detecting hallucinations.
86
3
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
A crash course on RAG systems—Part 5
Part 5 of the RAG crash course focuses on the implementation of key components for multimodal RAG systems, such as CLIP embeddings, multimodal prompting, and tool calling. The series aims to educate readers on building reliable RAG systems that can reduce costs and handle complex data types, ultimately aiding businesses in achieving greater impact.
83
4
Article
Towards AI·1y
10 No-Nonsense Machine Learning Tips for Beginners (Using Real-World Datasets)
Get practical with machine learning by starting with simple models like Linear Regression and Decision Trees using real-world datasets from the UCI Machine Learning Repository. Focus on hands-on experimentation to build a strong foundation before diving into more complex models like neural networks.
71
5
Video
YouTube·1y
Data Science Full Course - Complete Data Science Course | Data Science Full Course For Beginners IBM
Data science is a rapidly growing field with significant career opportunities due to the massive amounts of data produced and advancements in computing power and artificial intelligence. The course from IBM introduces key concepts and skills necessary for starting a career in data science, including big data, artificial intelligence, and cloud computing. It provides instructional videos, readings, practice assessments, and insights from data science professionals, concluding with a case study and a final peer-reviewed project.
67
6
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
A crash course on RAG systems—Part 6
Part 6 of the crash course on RAG systems explores how to build a more extensive and capable multimodal RAG system using CLIP embeddings, multimodal prompting, and tool calling. The post includes a unique dataset combining social media posts with images to provide a practical learning experience. The series covers everything from foundational components and evaluation to optimization and handling complex documents, aiming to help users implement reliable RAG systems and solve key NLP challenges with LLMs.
57
7
Article
Medium·1y
Beyond If/Else: Advanced Python Control Flow
Explore advanced methods for control flow in Python without using traditional if/else statements. Learn how to build a dynamic calculator using modules like operator, eval(), and the new match statement from Python 3.10, along with other techniques such as dictionary dispatch and lambda functions.
50
2
8
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
[Hands-on] Tool calling in LLMs
Tool calling allows language models to perform specific tasks by invoking external tools or APIs. The process involves recognizing when an external tool is needed, invoking the tool, and integrating its output into the model's response. This enhances the flexibility and capability of LLMs. A demo is provided to build a stock price retrieval assistant using the yfinance library.
41
1
9
Article
Community Picks·1y
dask/dask: Parallel computing with task scheduling
Dask is a flexible parallel computing library designed for analytics. It enables efficient task scheduling and is licensed under the New BSD License.
36
1
10
Article
Collections·1y
Building a Recommendation System in Python Using Surprise
Learn how to build a recommendation system in Python using the Surprise module, with an example based on the MovieLens dataset. The guide covers loading data, splitting datasets, training the model using the SVD algorithm, and making predictions to evaluate accuracy. Additional tools like TensorFlow and PyTorch for advanced systems are also mentioned.
25
11
Article
Towards AI·1y
Llm Fine Tuning Guide: Do You Need It and How to Do It
Fine-tuning a Large Language Model (LLM) is often unnecessary for many commercial applications, but it can be useful for tasks requiring specific chat formats, domain knowledge, or cost-effective, specialized tasks. Fine-tuning involves data preparation, including deduplication and removal of personal information, and can be done using techniques like LoRa (Low-Rank Adaptation) or QLoRA. Using reinforcement learning with human feedback (RLHF) or direct preference optimization (DPO) can align models with human preferences. For fine-tuning and hosting, cloud platforms like AWS SageMaker and collaborative tools like HuggingFace are recommended.
25
12
Article
Hacker News·1y
The RAM myth
Commonly used straightforward algorithms for data sharding may perform suboptimally due to frequent cache misses. By sorting elements by their group before processing, one can significantly reduce cache misses, improving performance even with large in-memory data. Radix sort and other cache-aware algorithms offer further optimizations. These techniques are beneficial for handling big data where efficient memory usage is critical.
24
2
13
Article
Machine Learning Mastery·1y
Machine Learning vs. Traditional Analytics: When to Use Which?
Understanding the differences between data analytics, data science, big data, and business intelligence is crucial. Data analytics focuses on predicting future patterns to support business decisions, while machine learning, a subfield of AI, builds models to perform tasks like classification and regression. Machine learning is best used for making predictions from complex datasets, whereas traditional analytics methods are suited for understanding historical data and identifying trends in smaller datasets.
24
14
Article
Towards Data Science·1y
Becoming a Data Scientist: What I Would Do If I Had to Start Over
The journey to becoming a data scientist involves starting with a solid foundation in mathematics, learning programming (preferably Python), mastering SQL for data manipulation, and understanding machine learning algorithms. Equally important are practical experience and business acumen, which allow technical skills to translate into business value. Begin small, apply your knowledge to real-world problems, and progressively build on your projects to enhance your skills and showcase your capabilities.
23
15
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
Train Classical ML Models on Large Datasets
Cohere announces Command R7B, a lightweight, fast, and enterprise-ready multilingual 7B-parameter model suitable for real-time chatbots and AI agents. Additionally, methods to train classical ML models on large datasets, such as using big-data frameworks like Spark MLlib or the Random Patches approach, are discussed. Random Patches, which involves sampling data patches for tree-based models, often performs better than traditional random forests in certain cases.
22
16
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
A crash course on RAG systems—Part 7
Part 7 of the RAG crash course focuses on building graph RAG systems using a graph database to store entities and relationships. It highlights the advantages of structured data for LLMs and includes implementation details suitable for beginners. The series covers foundational aspects, evaluation, optimization, and multimodal techniques for RAG systems. Understanding RAG systems can help reduce costs, drive revenue, and scale ML models effectively.
21
17
Article
JetBrains·1y
Introduction to Sentiment Analysis in Python
Sentiment analysis in Python helps determine the emotional tone of text using natural language processing (NLP). The post reviews several sentiment analysis techniques and Python packages, including VADER, TextBlob, NLTK, spaCy, and Hugging Face Transformers. It highlights the advantages and limitations of these methods and showcases how PyCharm can simplify working with these tools and visualizing results for better data interpretation.
21
18
Article
Crio.Do·1y
Understanding Data Modeling in the age of AI: For Beginners
Data modeling involves organizing information into a structured format and is essential for building systems that use data for tasks like AI and analysis. It includes three main steps: conceptual modeling, logical modeling, and physical modeling. Benefits include simplifying complexity, ensuring accuracy, saving time and money, supporting growth, and improving communication and system speed. Learning data modeling now can boost your career in AI and data science.
21
19
Article
Lil’Log·1y
Reward Hacking in Reinforcement Learning
Reward hacking in reinforcement learning (RL) occurs when agents exploit flaws in reward functions to obtain high rewards without genuinely completing the intended task. This issue has become a practical challenge with the rise of language models and RLHF (Reinforcement Learning from Human Feedback). Poorly designed reward functions can lead to unintended agent behaviors and are challenging to specify accurately. Various strategies and concepts, such as reward tampering and specification gaming, have been identified as related to this problem. Mitigation strategies include better reward function design, adversarial training, and anomaly detection.
21
20
Article
DigitalOcean Community·1y
Master Python File Operations: Read, Write & Delete Files
This tutorial covers essential file operations in Python, including how to open, read, write, and delete files. It explains using different file modes, handling file-related errors, and the best practices like using the `with` statement. Advanced topics include copying, moving files, working with directories, and using `shutil` and `os` libraries for various file manipulations.
19
2
21
Article
GoPenAI·1y
How Does Support Vector Regression (SVR) Differ from Linear Regression?
Support Vector Regression (SVR) extends Support Vector Machines for regression tasks, introducing an epsilon-insensitive tube to manage errors within a margin. Distinct from linear regression, which minimizes overall error, SVR focuses only on significant deviations. Key concepts include the epsilon-insensitive tube, slack variables, and support vectors, providing a robust alternative for noisy datasets.
16
1
22
Article
Towards AI·1y
Real-Time Object Detection using YoloV7 on Google Colab
Learn how to perform real-time object detection using YOLOv7 on Google Colab in this detailed tutorial. Understand the structure of the training data and the bounding box representation used in YOLO models, and follow steps to apply the model to your videos.
15
23
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
The Intuition Behind Using ‘Variance’ in PCA
Principal Component Analysis (PCA) leverages variance preservation to reduce dimensions while retaining essential information. By retaining more variance during dimensionality reduction, less information is lost. PCA transforms data to create uncorrelated features and drops features based on their variance, which can be influenced by outliers. Further mathematical details about PCA, including vector projections, Lagrange Multipliers, and optimization steps, are available. Discussions on other machine learning topics like graph neural networks, NLP systems, quantization, and federated learning are also provided.
14
24
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
Breathing KMeans vs KMeans
Breathing KMeans improves the standard KMeans clustering algorithm by adding and removing centroids based on error and utility metrics. This approach ensures better clustering results with reduced runtime overhead compared to multiple initializations of KMeans. The algorithm is implemented in the 'bkmeans' library with a sklearn-like API.
14
25
Article
Medium·1y
Lasso and Elastic Net Regressions, Explained: A Visual Guide with Code Examples
Lasso and Elastic Net regressions are advanced variations of linear regression. Lasso automatically selects significant features by applying a penalty that can reduce some coefficients to zero, making it useful for feature selection. Elastic Net combines the traits of both Lasso and Ridge regressions, utilizing penalties to manage feature selection and correlation. Both use the coordinate descent algorithm for optimization, which updates coefficients iteratively. Practical code examples using Python's scikit-learn library demonstrate the implementation and training of these models.
14

See all Data Science archives