Best of Data Analysis — June 2024
- 1
- 2
System Design Codex·2y
3 Types of Event Patterns in EDA
Event-Driven Architecture (EDA) revolves around components sending and receiving events to communicate. There are three primary event patterns: Event Notifications, which inform other components of an occurrence with minimal data; Event-Based State Transfer, where events containing necessary information are pushed to consuming components; and Event Sourcing, which involves storing and replaying events to reconstruct entity states. Each pattern offers unique advantages for different scenarios.
- 3
- 4
Code Like A Girl·2y
SQL Essentials: GROUP BY vs. PARTITION BY explained
Understanding the differences between GROUP BY and PARTITION BY clauses in SQL is crucial for efficient data analysis. GROUP BY is used to summarize data by grouping rows that have the same values in specified columns, while PARTITION BY is used for detailed calculations within specific partitions. GROUP BY can reduce the number of rows by summarizing data, whereas PARTITION BY adds additional information without reducing rows. Both clauses support aggregate functions, but PARTITION BY also supports ranking and time-series functions.
- 5
Engineering Enablement·2y
Measuring Developer Experience at Google
Google's Engineering Satisfaction (EngSat) survey, in use since 2018, measures developer productivity by combining survey insights with system-based metrics. The survey is conducted quarterly and has adapted over time to include consistent staffing, effective processes, and robust infrastructure. EngSat has helped Google track productivity changes, address technical debt, and validate new metrics. The program faces challenges such as increasing survey length and decreasing response rates, which are managed through strategic sampling and transparency in reporting. Google's approach offers valuable advice for other organizations wanting to implement developer surveys.
- 6
KDnuggets·2y
Why You Should Learn SQL in 2024
Learning SQL is crucial in 2024 as it remains a highly demanded skill for data professionals, enabling efficient data management and analysis. SQL's readability, standardization, and integration with other tools like Python and R make it an invaluable asset in any data-centric environment. Mastering SQL can significantly enhance one's ability to handle large datasets, perform complex queries, and interact with various database systems.
- 7
Towards Data Science·2y
Exploratory Data Analysis in 11 Steps
Exploratory Data Analysis (EDA) involves a structured process that starts with stakeholder communication to identify objectives, followed by defining analysis goals and research questions. Analysts should review existing knowledge, assess data accessibility, clean and transform data, and use summary statistics to understand data patterns. Key findings should be documented as the analysis progresses and shared appropriately with stakeholders.
- 8
JetBrains·2y
How to Move From pandas to Polars
Polars is gaining popularity in the data science community due to its speed and security benefits, being written in Rust and based on Apache Arrow. Polars offers a similar API to pandas, which lowers the barrier for migration. It handles large data sets more efficiently with its lazy API and better concurrency capabilities. Tools like PyCharm support Polars, smoothing the transition. The primary differences in syntax and migration tips are provided, ensuring a relatively seamless switch from pandas to Polars.
- 9
Medium·2y
Forecasting Gold Prices with TimeGPT
This post explores how TimeGPT, a time series LLM model, can be used with gold price data to accurately forecast future prices. The post covers the process of retrieving gold price data, preprocessing the data, setting up TimeGPT, and interpreting the forecasted prices and confidence intervals.
- 10
Hacker News·2y
goldmansachs/gs-quant: Python toolkit for quantitative finance
GS Quant is a Python toolkit developed by Goldman Sachs for quantitative finance, facilitating the development of trading strategies, derivative structuring, and risk management solutions. It leverages 25 years of experience in global markets and includes statistical packages for data analytics applications. It requires Python 3.6 or greater and can be installed via PIP.
- 11
NVIDIA Developer·2y
Machine Learning – What Is It and Why Does It Matter?
Many industries use data science and machine learning to recognize patterns, detect changes, and make predictions to enhance their operations. The availability of open-source tools has facilitated this trend since the mid-2000s. Today, improvements in predictive models can result in significant financial gains. However, training these models requires significant computational resources, with GPUs offering a solution to scalability issues that CPUs can no longer handle due to the limitations posed by Moore's law.
- 12
Product Hunt·2y
SQL Workbench - In-browser SQL Workbench for data querying & visualization
SQL Workbench, launched on June 19th, 2024, offers an in-browser solution for data querying and visualization. Featured under Developer Tools and Data & Analytics, it marks the first release of this tool. Perfect for users looking for a browser-based SQL Workbench to manage their data efficiently.
- 13
- 14
- 15
- 16
Towards Data Science·2y
From Code to Insights: Software Engineering Best Practices for Data Analysts
This post provides software engineering best practices for data analysts. It covers key lessons, such as code readability, automation of repetitive tasks, mastering tools, managing environments, optimizing program performance, DRY principle, leveraging testing, using version control systems, seeking code reviews, and staying up-to-date.
- 17
Hacker News·2y
My thoughts on Python in Excel
Python in Excel is an alternative to the Excel formula language and has use cases for computationally intensive tasks, AI, advanced visualizations, and time-series analysis. However, it is not suitable for beginners or interactive data analysis. There are also restrictions such as not being able to use custom packages or connect to web APIs.
- 18
Towards AI·2y
A Data Analysis Project — Smart Phones Data Analysis.
A data analysis project on smartphone data. Extracting insights on brands, models, prices, ratings, 5G capability, IR blaster, processor brands, cores, battery capacity, RAM capacity, screen size, operating systems, resolution, refresh rate, and more.
- 19
- 20
Grafana Labs·2y
5 useful transformations you should know to get the most out of Grafana
Discover five useful transformations in Grafana that can help you better understand your data, including grouping data, organizing fields by name, filtering data by value, sorting data, and partitioning data by values.
- 21
Daily Dose of Data Science | Avi Chawla | Substack·2y
Even Two Outliers Can Distort Your Data Analysis
Outliers can significantly distort the results of data analysis, such as correlation and regression fits, leading to misleading conclusions. Visualizing data through plots like PairPlot is crucial to identify these outliers and validate statistical measures. Manual code reviews are often inefficient, but tools like Sourcery leverage AI to provide instant, human-like code reviews, significantly speeding up the process.
- 22
Towards Data Science·2y
Back to Basics: Databases, SQL, and Other Data-Processing Must-Reads
Relational databases and SQL queries remain vital for daily workflows of data professionals, despite the buzz around LLMs. This post highlights essential reads on maintaining and growing skills in data and ML tasks, emphasizing the interconnectedness of foundational data operations and advanced AI tasks. Featured topics include simplifying Python code for data engineering, learning SQL for data analytics, using pivot tables in SQL, managing Excel charts with VBA, and turning relational databases into graph databases.
- 23
MotherDuck·2y
All-in-SQL Hybrid Search in DuckDB: Integrating Full Text and Embedding Methods
This post explores integrating Full Text Search (FTS) and Embedding Search to create a Hybrid Search system in DuckDB. It details the methods and SQL implementations used to combine these search techniques, focusing on the need for exact keyword matching and semantic understanding. The post also covers how to rank documents using Reciprocal Ranked Fusion and Convex Combination metrics, providing examples using the Kaggle Movies dataset.
- 24
- 25