Best of Data Analysis — November 2024

1
Article
Hacker News·2y
Visualizing 13 million BlueSky users
An exploration into creating a visualization of 13 million BlueSky users, leveraging force-directed graph layout techniques and UMAP for dimensionality reduction. The process involved aggregating follow and unfollow events using WebSocket on BlueSky's relay service, followed by parallelized computation on a home server to handle the vast data. The project culminated in an interactive map to explore the network and highlighted the importance of interactivity for meaningful large-scale visualizations.
400
18
2
Article
Itamar Gilad·1y
4 Levels of Data Proficiency
Data proficiency is essential for product companies to thrive. This post outlines four levels of data proficiency: business modeling, data-driven, evidence-guided, and AI-powered. Business modeling focuses on creating models to understand customer behavior and business growth. Data-driven companies prioritize data collection, processing, and consistent analysis. Evidence-guided organizations test assumptions and act on validated data. The AI-powered level is speculative, suggesting that future advancements in AI could significantly enhance data-driven decision-making and business modeling.
162
7
3
Article
Community Picks·2y
Bulletproof Typescript with Valibot
Valibot is a modular and tree-shakeable schema library for Typescript that offers smaller bundle sizes compared to similar libraries like Zod. It enables the creation of readable, resilient, and type-safe code through practical examples and design patterns. Valibot's functions and methods allow for robust run-time data validation, including handling JSON configurations, user form inputs, and server requests. The library facilitates a consistent and predictable data transformation pipeline, significantly enhancing code reliability and maintainability.
112
4
Article
Daily Dose of Data Science | Avi Chawla | Substack·1y
Pandas vs. FireDucks Performance Comparison
FireDucks is a highly optimized alternative to Pandas, boasting a significant speed improvement through lazy execution. Users only need to replace their Pandas import with FireDucks. Benchmarks show FireDucks outperforming Pandas and other libraries like Modin and Polars, particularly in its speedy performance. The post provides instructions for installing FireDucks, using it in Jupyter Notebook, and integrating it into existing Python scripts.
98
2
5
Video
Tech With Tim·1y
How To Make Money From Python - A Complete Guide
Learn various ways to make money with Python skills beyond traditional employment. The methods include building bots and automation tools, creating courses and content, integrating AI for businesses, engaging in algorithmic trading, developing full-stack web applications, and performing data analysis and cleaning. The guide provides practical examples and insights to get started in these niches, even if you're not an expert.
81
6
Article
Product Hunt·2y
Trench - Open source analytics infrastructure
Trench is a new open source analytics infrastructure tool that was launched on November 10th, 2024. It is designed for developers and integrates with GitHub, offering robust data and analytics capabilities. This marks the first launch of Trench.
75
1
7
Video
Community Picks·2y
15 Machine Learning Lessons I Wish I Knew Earlier
Switching to a career in machine learning or data science can be challenging. Key takeaways include understanding the importance of mastering fundamentals over trendy tools, handling imposter syndrome, emphasizing data pre-processing, understanding the business problem fully, and continuously learning and adapting to new advancements. Collaboration and communication skills are essential, as well as practical experience with real-world data projects. Networking plays a crucial role in career growth.
67
8
Article
InfoSec Write-ups·2y
Dark Web Scraping Using AI : Tools, Techniques, and Challenges
Learn how to use AI for scraping dark web data by leveraging Python and the Llama model. This guide covers setting up the necessary tools, including Streamlit, LangChain, Selenium, and BeautifulSoup, in a Python virtual environment. It demonstrates a step-by-step process to create a web scraper, retrieve and clean webpage content, and analyze the scraped data using Llama for accurate and relevant insights.
40
6
9
Article
Data Engineer Things·2y
I spent 3 hours learning the overview of ClickHouse
ClickHouse is a high-performance, column-oriented SQL OLAP system developed initially for Yandex Metrica. It supports high ingestion rates, low-latency queries, and is adaptable for various data architectures. The system's architecture includes a query processing layer with vectorized execution, a storage layer with diverse table engines, and an integration layer for extensive external connectivity. ClickHouse uses sharding and replication to handle large-scale data efficiently.
30
10
Video
Fireship·2y
Apache Spark in 100 Seconds
Apache Spark is an open-source data analytics engine designed to process massive streams of data from multiple sources at high speed by performing most tasks in memory. Created in 2009 at UC Berkeley, it is widely used in various fields, including e-commerce and space research. It supports multiple languages through APIs and can be run locally or scaled across distributed systems. Spark also has robust machine learning capabilities with its MLlib library.
29
11
Article
Towards AI·2y
This Pandas Trick Will Blow Your Mind As a Data Scientist!
Learn how to automate data analysis with Pandas through an 8-step process. The guide covers setting up your environment, uploading CSV files, and generating comprehensive reports with just one click. Essential libraries include Pandas, Numpy, Ipywidgets, Matplotlib, and Seaborn.
25
12
Article
Hacker News·2y
Stream, transform, and route PostgreSQL data in real-time.
Stream data changes in near real-time using PostgreSQL Logical Replication with pg_flo. Perform parallelizable bulk copies for fast initial synchronization, apply regex-based transformations to mask sensitive data, and route data between differently named tables or the same table with custom column mappings. Deploy easily using Docker, and leverage pg_flo for secure production to staging synchronization, and data archiving and analytics.
25
1
13
Article
A Java geek·2y
DuckDB in Action
DuckDB in Action by Mark Needham, Michael Hunger, and Michael Simons offers a detailed guide to DuckDB with a step-by-step approach. The book covers DuckDB basics, advanced SQL queries, and its integration with ecosystems like Python’s Pandas and Apache Spark. Despite being informative, the book struggles with focus, fluctuating between teaching DuckDB and general SQL learning.
24
14
Video
Oxylabs·1y
How to Scrape Google Trends Data With Python
Learn how to use Python to scrape data from Google Trends and gain insights into keyword popularity, market research, and societal trends. The guide covers the installation of necessary libraries, making API requests, handling responses, saving results to CSV, and comparing data across different keywords and regions.
21
15
Video
YouTube·2y
I Tried 50 Data Analyst Courses. Here Are Top 5
Many online courses in data analytics are lengthy and filled with fluff, but some are genuinely worthwhile. Key recommended courses include the Google Analytics Certificate for web and marketing analytics, Microsoft Power BI Certification (PL-300) for data visualization and SQL (DP-300) for database skills, DataCamp's hands-on training tracks offering discounts on certification exams, Tableau Certified Data Analyst credential, and Harvard's comprehensive Data Science Professional Certificate. These courses offer industry-recognized credentials that validate skills employers seek in data analysts.
19
1
16
Article
Towards AI·2y
Standard Deviation For Dummies
Standard deviation measures the amount of variation in a dataset, and is closely related to variance. Variance shows how different the items in a group are, while standard deviation provides this in an easily interpretable unit. Understanding these concepts involves calculating the variance and then taking its square root to find the standard deviation. In a normal distribution, most values fall within a certain range around the mean, making it a critical tool for data analysis.
18
2
17
Article
Hacker News·1y
1 dataset. 100 visualizations.
An information design agency aimed to create 100 different visualizations from a single dataset. Their goal was to demonstrate the variety and complexity possible in data visualization, highlighting how different techniques can tell unique stories with the same data.
17
18
Article
Hacker News·2y
The Polars vs pandas difference nobody is talking about
The post highlights the unique advantages of using Polars over pandas for group-by operations, particularly focusing on non-elementary group-by aggregations. While pandas struggles with efficiently performing complex group-by operations without resorting to Python lambda functions, Polars allows users to express these aggregations cleanly and efficiently. The author emphasizes the importance of API innovation and the potential limitations of strictly adhering to pandas' API design.
17
1
19
Article
Towards Data Science·2y
The Four Pillars of a Data Career
Breaking into a data career typically requires proficiency in four pillars: spreadsheets (Excel), SQL, visualization tools (Tableau or Power BI), and scripting languages (Python or R). The post suggests focusing on Excel for entry-level roles, with additional recommendations for learning SQL basics, creating standard charts in visualization tools, and understanding programming essentials in scripting languages.
17
20
Article
asayer·1y
Data Lake vs Data Warehouse: Key Differences and When to Use Each
Data lakes and data warehouses are two primary storage solutions for big data. Data lakes store raw and diverse data types, making them ideal for machine learning and extensive data analytics. Data warehouses store structured data for quick analysis and reporting, suitable for business intelligence and real-time insights. A data lakehouse combines features of both, providing flexibility and high-speed performance for a variety of data storage needs.
15
21
Article
Crunchy Data·2y
8 Steps in Writing Analytical SQL Queries
Writing complex SQL queries involves starting with simple queries and progressively adding complexity while verifying accuracy at each step. Key steps include defining desired data, investigating and sampling data, confirming simplicity, adding joins cautiously, performing summations, and rigorously debugging. SQL's power lies in its ability to utilize simple, standardized logic blocks to extract accurate data from complex structures.
14
22
Article
Machine Learning News·2y
How to Become a Data Analyst? Step by Step Guide
Data analysts play a vital role in converting raw data into meaningful insights for strategic decision-making across industries. Key responsibilities include data collection, cleaning, statistical analysis, and creating visualizations. Essential skills comprise proficiency in tools like Python, R, Tableau, and Excel, along with strong analytical and communication abilities. Building a career involves gaining practical experience through projects, internships, and participation in data-centric communities. Networking and continuous learning are also crucial for career advancement.
14
23
Video
YouTube·2y
6-week Free Data Engineering Boot Camp Launch Video | DataExpert.io
A six-week free Data Engineering Boot Camp is launching, featuring over 45 in-depth videos aimed at elevating skills in data engineering. The first two weeks cover dimensional and fact data modeling. Afterward, the boot camp splits into infrastructure and analytics tracks, addressing topics such as pipeline specs, data quality patterns, PySpark unit testing, Kafka, real-time data processing, and more. Participants can earn certificates and get hands-on assignments with AI-generated feedback, with content published daily from November 15th until the end of the year.
13
24
Article
DuckDB·2y
Analyzing Open Government Data with duckplyr
duckplyr is a high-performance, drop-in replacement for dplyr in R, powered by DuckDB. This post demonstrates how to use duckplyr to clean and analyze an open data set from New Zealand's government, showcasing the library's capabilities for efficient data wrangling and analysis. With enhanced CSV parsing and holistic optimization, duckplyr ensures faster and more ergonomic handling of large datasets compared to dplyr.
13
25
Article
Towards Data Science·2y
A Practical Framework for Data Analysis: 6 Essential Principles
Discover six key principles for effective data analysis, drawn from years of industry experience in consumer tech. These principles include establishing a baseline, normalizing metrics, applying MECE grouping, aggregating granular data, removing irrelevant data, and using the Pareto principle. These guidelines can help analysts uncover valuable business insights and improve their exploratory data analysis (EDA) practices.
12

See all Data Analysis archives