Best of Big Data — August 2024

1
Article
Medium·2y
How Did LinkedIn Handle 7 Trillion Messages Daily With Apache Kafka?
LinkedIn uses Apache Kafka to manage and process up to 7 trillion messages daily. They achieve reliability and scalability through a multi-tiered Kafka deployment across multiple data centers, leveraging local and aggregate clusters. LinkedIn ensures message completeness with an internal auditing tool that tracks sent and consumed messages. They maintain a close relationship with the open-source Kafka community by regularly integrating features and patches from their internal branches into the upstream Kafka branch.
175
4
2
Article
Quastor Daily·2y
How Canva Collects 25 Billion Events Per Day
Canva processes over 25 billion events daily using AWS Kinesis, benefiting from its real-time data analysis and cost-saving features. Their data pipeline involves event batching, compression, and enrichment before routing to Snowflake for further analysis. The switch from AWS SQS to Kinesis significantly reduced their costs by 85%.
73
1
3
Article
KDnuggets·2y
Project Ideas to Master Data Engineering
To effectively learn data engineering, working on projects is essential. Key skills to focus on include data transformation, data visualization, building data pipelines, and implementing data storage solutions like data lakes and data warehouses. The post suggests six project ideas to cover these aspects: building an end-to-end data pipeline, transforming data sets, implementing a data lake, creating a data warehouse, processing real-time data, and visualizing data with dashboards.
68
4
Article
ByteByteGo·2y
Trillions of Indexes: How Uber’s LedgerStore Supports Such Massive Scale
Uber's LedgerStore is a custom-built solution to manage trillions of financial transaction records efficiently. It ensures data immutability and supports various types of indexes including strongly consistent, eventually consistent, and time-range indexes. The migration from DynamoDB to LedgerStore for Uber's payment data was driven by the need for cost savings, simplified architecture, improved performance, and tailored features for financial data management. This transition involved handling 1.2 PB of compressed data with zero data inconsistencies detected over six months.
55
3
5
Article
KDnuggets·2y
Tools Every AI Engineer Should Know: A Practical Guide
Being an AI engineer requires expertise in various tools and skills such as Python, R, big data frameworks like Hadoop and Spark, and cloud services like AWS, GCP, and Microsoft Azure. These tools are essential for building and optimizing AI systems. An AI engineer must also have solid programming knowledge, a deep understanding of machine learning, and practical experience through data projects, competitions, and open-source contributions.
54
1
6
Article
Quastor Daily·2y
How Lyft Processes Terabytes of Real Time Data
Lyft transitioned from using Apache Druid to ClickHouse for real-time data processing to handle their sub-second queries and extensive data needs. The shift addressed issues like Druid's steep maintenance learning curve and complex infrastructure. ClickHouse offered simplified management, reduced learning curves, data deduplication, lower costs, and specialized engines, despite facing some initial challenges such as query caching performance and Kafka ingestion issues.
33
1
7
Video
Community Picks·2y
A Javascript Software Engineer bought a house | PR Review [17]
17
2
8
Article
Towards Dev·2y
Spark — Beyond Basics: Hidden actions in your spark code
The post discusses hidden actions that can be mistaken for transformations in Apache Spark. It uses examples from Spark code snippets, such as `read.csv()`, `df.groupby().pivot()`, and `foreach()`, to explain how certain operations trigger jobs. Key insights include the impact of the inferSchema option turning a transformation into an action, and the unique behavior of pivot and foreach actions.
15
9
Article
Daily Dose of Data Science | Avi Chawla | Substack·2y
Spark != Pandas + Big Data Support
Pandas and Spark both work with data tables, but their approaches differ significantly, mainly due to Spark's lazy evaluation strategy. This can lead to performance bottlenecks if not managed properly. Unlike Pandas, Spark evaluates transformations only when an action is triggered. This deferred computation allows for optimization but can cause redundant computations. One common solution is using the `df.cache()` method to store the results of transformations in memory, thereby improving performance. It's crucial to release cached memory with `df.unpersist()` once it's no longer needed. Learning Spark can greatly enhance your data science skills due to its extensive demand in the industry.
14

See all Big Data archives