Best of Data Science — November 2024

1
Video
ByteByteGo·2y
Big-O Notation in 3 Minutes
Understanding Big O notation is crucial for measuring algorithm efficiency and optimizing code performance. Various time complexities, from constant to factorial, have unique characteristics and practical applications. Real-world performance can be influenced by factors like caching, memory usage, and hardware specifics, making it essential to profile your code and understand your hardware for optimal results.
147
5
2
Video
freeCodeCamp·2y
AI Foundations Course – Python, Machine Learning, Deep Learning, Data Science
This comprehensive 11-hour AI Foundations Course covers essential topics in machine learning, data science, and AI. It offers both theoretical knowledge and practical implementation with Python. The course includes real-world case studies, career guidance, startup advice, and interview preparation. Ideal for aspiring machine learning or AI engineers, it teaches fundamental to advanced algorithms, hands-on data analytics, and provides insights from industry professionals.
122
1
3
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
A Crash Course on Building RAG Systems – Part 4
Part 4 of the crash course on building RAG systems focuses on implementing RAG on multimodal data, specifically complex documents with tables, texts, and images. This series covers foundational components, evaluation methods, optimization techniques, and handling large data sets, making it highly beginner-friendly. Understanding how to build reliable RAG systems can reduce costs and enhance scalability for enterprises, bypassing the need for fine-tuning large language models (LLMs).
118
4
Article
Pythonner·2y
3. Roadmap To Python Programming!
Learning Python is essential for entering fields like Software Development, Artificial Intelligence, and Data Science. It is important to understand core Python topics including variables, data types, loops, conditional statements, and functions, along with tools like IDEs and libraries. The roadmap offers comprehensive topics to master before starting your coding journey.
99
8
5
Video
YouTube·1y
Learn Machine Learning Like a GENIUS and Not Waste Time
Learn the smart way to master machine learning without wasting time. Focus on the essential skills: Python programming, data analysis with Pandas, core math concepts like statistics and linear algebra, and simple machine learning algorithms. Practice through real projects, not just tutorials, and learn to adapt quickly as technology evolves. Understand the fundamentals deeply before moving to more advanced topics. Collaborate, share your projects, and avoid common pitfalls to maximize your learning efficiency.
81
6
Article
Collections·2y
Why There’s No Better Time to Learn LLM Development
The rapid evolution of Large Language Models (LLMs) offers significant efficiency gains, making it an ideal time to learn LLM development. The comprehensive guide *Building LLMs for Production* helps bridge the skill gap for aspiring developers. Key techniques covered include prompting, fine-tuning, and data preparation. The updated edition offers new chapters on data intricacies, and indexes and retrievers, ensuring developers have the latest insights and practices. The guide is available at a discounted rate on the Towards AI Academy platform.
72
2
7
Video
Community Picks·2y
15 Machine Learning Lessons I Wish I Knew Earlier
Switching to a career in machine learning or data science can be challenging. Key takeaways include understanding the importance of mastering fundamentals over trendy tools, handling imposter syndrome, emphasizing data pre-processing, understanding the business problem fully, and continuously learning and adapting to new advancements. Collaboration and communication skills are essential, as well as practical experience with real-world data projects. Networking plays a crucial role in career growth.
67
8
Article
Metabase·2y
How to visualize time-series data: best practices
Learn the best practices for visualizing time-series data, including selecting the right chart types and structuring your data for clear and impactful visualizations. Charts such as line, bar, area, trend, and waterfall are discussed, along with techniques like using offsets for comparisons. A time-series cheat sheet dashboard is also available for easy reference.
66
1
9
Article
InfoWorld·2y
The machine learning certifications tech companies want
Machine learning certifications are becoming increasingly valuable as organizations leverage AI for various applications such as product enhancement, speech recognition, and fraud detection. Experts suggest that these certifications provide structured learning, proof of skills, and can lead to better job prospects. Popular certifications include AWS Certified Machine Learning – Specialty, Databricks Certified Machine Learning Professional, Google Cloud Professional Machine Learning Engineer, Microsoft Certified: Azure Data Scientist Associate, and Stanford University's Machine Learning Specialization. While certifications can enhance employability, hands-on experience with machine learning tools is also crucial.
66
1
10
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
Simplify Python Imports with Explicit Packaging
Learn how to simplify your Python project imports by explicitly packaging your project with an __init__.py file. This method not only helps to avoid redundant imports but also allows you to specify which classes and functions can be imported from the package. The article explains the difference between modules, packages, and libraries, and provides a step-by-step guide on how to use __init__.py to streamline your code.
56
1
11
Article
Real Python·2y
Introduction to Web Scraping With Python – Real Python
Web scraping is the process of collecting and parsing raw data from the web using powerful Python tools. This video course offers 12 lessons covering methods such as string methods, regular expressions, and HTML parsing. It includes downloadable resources, subtitles, transcripts, an interactive quiz, and a certificate of completion to help you effectively scrape data from websites.
47
12
Article
Machine Learning Mastery·2y
Building a Robust Machine Learning Pipeline: Best Practices and Common Pitfalls
A machine learning pipeline is essential for operating models and delivering value. For robustness, it's crucial to structure the pipeline well and maintain reliability at each stage, even with changing environments. Some key pitfalls to avoid include ignoring data quality, overcomplicating models, inadequate monitoring, and not versioning data and models. Best practices involve using appropriate model evaluation metrics, employing MLOps for deployment and monitoring, and preparing comprehensive documentation.
36
13
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
16 Popular Open-source Contributions by Big Tech
Big tech companies like Microsoft, Google, Meta, Yandex, and NVIDIA have significantly contributed to the machine learning ecosystem through various open-source projects. These contributions include Microsoft's DeepSpeed and ONNX, Google's TensorFlow and JAX, Meta's PyTorch and LLaMA, Yandex's CatBoost and ClickHouse, and NVIDIA's RAPIDS and TensorRT. Understanding these tools can help you tackle real-world problems efficiently.
35
1
14
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
A Crash Course on Building RAG Systems – Part 2
Gain expertise in implementing RAG systems with this beginner-friendly guide. Part 2 builds on the foundations of Part 1, focusing on practical implementation. Learn how RAG systems address challenges in NLP and help bypass the costs of fine-tuning LLMs, offering enterprises significant cost savings. This crash course covers essential techniques and practical guidance for building reliable RAG applications.
34
15
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
The No-code Data Science Tool Stack
Explore 8 powerful no-code tools for data science and machine learning tasks. Tools like Mito, Gigasheet, PivotTableJS, Drawdata, PyGWalker, Visual Python, Tensorflow Playground, and ydata-profiling offer various features like spreadsheet interfaces, large-scale data analysis, pivot tables, interactive visualizations, GUI-based code generation, and automated EDA reports. Additionally, SwarmZero addresses the limitations of OpenAI's Swarm by offering a highly customizable, production-ready multi-agent app framework.
27
1
16
Article
Towards AI·2y
This Pandas Trick Will Blow Your Mind As a Data Scientist!
Learn how to automate data analysis with Pandas through an 8-step process. The guide covers setting up your environment, uploading CSV files, and generating comprehensive reports with just one click. Essential libraries include Pandas, Numpy, Ipywidgets, Matplotlib, and Seaborn.
25
17
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
Traditional RAG vs. HyDE
Traditional RAG systems often retrieve irrelevant contexts due to questions not being semantically similar to their answers. HyDE mitigates this by generating a hypothetical answer to the query and embedding it using a contriever model to fetch more relevant contexts. While this improves retrieval performance, it comes with increased latency and more LLM usage.
24
18
Article
Medium·2y
The Python Operator You Didn’t Know You Needed!
The walrus operator (`:=`), introduced in Python 3.8, allows developers to assign values to variables directly within expressions. This helps make the code more concise and readable, especially in conditions and loops. While it simplifies code, it should be used judiciously to avoid readability issues. The article provides examples of how to use the walrus operator effectively in different scenarios.
22
1
19
Article
ML & AI·2y
Building an AI Chat with Google Docs Knowledge Base Using Colab + Pinecone
Built a chat application using Pinecone's Assistants and retrieval augmented generation (RAG) with data from Google Drive. Pinecone's Python SDK in Google Colab was used to upload documents from a drive folder, facilitating easy indexing and embedding for efficient document retrieval during chats.
22
1
20
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
Categorization of Clustering Algorithms
The post provides an overview of six different types of clustering algorithms beyond the commonly known KMeans. These include centroid-based, connectivity-based, density-based, graph-based, distribution-based, and compression-based algorithms. The visual summary highlights key features and examples like DBSCAN and Gaussian Mixture Models. Additionally, the post promotes an open-source framework called Dynamiq for developing AI applications with AI Agents and LLMs, designed to streamline complex workflows.
22
21
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
Random Splitting Can be Fatal for ML Models
Randomly splitting data into training and validation sets can lead to data leakage, resulting in overfitting. Using techniques like GroupShuffleSplit in sklearn helps prevent this by grouping all related data points together and ensuring they end up in either the training or validation set. The method is illustrated using datasets with image captions and medical imaging, where specific features or identifiers are used as grouping criteria.
21
22
Video
Community Picks·2y
Markov Chains Clearly Explained! Part - 1
The post introduces Markov chains, a concept used in various fields such as statistics, biology, economics, physics, and machine learning. It explains how Markov chains rely on the current state to predict future states, using a restaurant example to illustrate transitions between states. The importance of the Markov property and stationary distribution is highlighted, along with a method to find these distributions using linear algebra. The post concludes by validating the theoretical results with a simulation and invites readers to engage for more content on advanced Markov chain topics.
21
1
23
Article
Hacker News·2y
circlemind-ai/fast-graphrag: RAG that intelligently adapts to your use case, data, and queries
Fast GraphRAG is a streamlined and adaptable framework for high-precision, agent-driven retrieval workflows. It offers cost-efficiency, dynamic data handling, and interpretable knowledge graphs that support real-time updates. You can easily install it from PyPi or source and integrate it into your retrieval pipeline with full type support and asynchronous operations. The framework leverages PageRank-based graph exploration for accurate and dependable results. Contributions to this open-source project are encouraged, and a managed service option is available for ease of deployment.
20
24
Video
Telusko·2y
What is Numpy and Why?
Numpy, short for Numerical Python, is a powerful library used in AI, machine learning, and data science for numerical computing. It addresses Python's limitations with arrays, supporting multi-dimensional arrays and offering performance benefits through optimized C code. Numpy is foundational to many other libraries like Pandas and Scikit-Learn, making it a crucial tool for scientific and data-intensive computations.
19
1
25
Article
Hacker News·1y
Hey, wait – is employee performance really Gaussian distributed??
Employee performance is likely Pareto-distributed rather than Gaussian, which highlights flaws in traditional performance management processes. The Pareto assumption suggests there is no statistical basis for annually firing the bottom 10% of the workforce, as low performers are more common and hiring errors should be treated as outliers. Performance management systems need updates including improved monitoring, cost analysis, and long-term perspectives.
18
6

See all Data Science archives