Best of Data Processing — June 2024

1
Article
Community Picks·2y
Kafka Migration and Event Streaming
Apache Kafka is an open-source distributed event and stream-processing platform known for its scalability and high throughput. This tutorial guides you through expanding a Kafka cluster by adding a new node and migrating topic partitions for optimal resource utilization using both manual scripts and Kafka Cruise Control. It also covers aggregating event data with ksqlDB, a database that operates on top of Kafka topics using SQL-like syntax. The tutorial includes Docker Compose configurations, command line instructions, and detailed steps for setting up and verifying the expanded Kafka cluster and ksqlDB integration.
31
2
Article
Collections·2y
Getting Started with PySpark: Efficient Data Processing for Beginners and Speeding up Machine Learning Projects
PySpark, the Python API for Apache Spark, facilitates efficient big data processing and machine learning by distributing tasks across multiple machines. It’s easy for Python users to learn and scales well from a single machine to large clusters. This overview covers installation, basic usage, and creating custom functions to enhance machine learning projects with streamlined data preparation tasks like quality checks and finding duplicates.
10
3
Article
KDnuggets·2y
What Data Scientists Should Know About OpenUSD
Discover how OpenUSD, a versatile framework, can enhance data science workflows by providing a unified data model, file format plugins, composability, custom pipelining, and extensibility. Learn about OpenUSD's Hydra framework and Hydra 2.0 for procedural processing, as well as available resources for learning and adding OpenUSD support to applications.
10

See all Data Processing archives