Best of Data Processing — 2025

1
Video
ByteByteGo·1y
What Is the Most Popular Open-Source AI Stack?
Open-source AI provides freedom to experiment and develop without proprietary restrictions with frameworks and tools like Next.js, Streamlit, Gradio, and FastAPI. The data layer involves retrieval-augmented generation (RAG), vector databases, and tools for diverse file formats. The back end includes FastAPI, Langchain, Metaflow, and OLama, facilitating scalable AI operations. The ecosystem also includes community-driven models from Hugging Face and dynamic LLMs like Mistral and DeepSeek.
275
2
Article
Tinybird·1y
Local first.
Tinybird introduces Tinybird Local, a Docker container that allows developers to run a full instance of Tinybird's data processing platform on their laptops. This local-first approach enables development, testing, and deployment of data applications both locally and in the cloud seamlessly. The container includes core Tinybird functionalities and several optimizations for performance but lacks some cloud-specific features. The initiative aims to provide a more controlled, offline, and versatile development environment.
175
3
Article
freeCodeCamp·46w
How to Transform JSON Data to Match Any Schema
Learn how to transform JSON data to match specific schemas using two approaches: pure Python and pandas. The tutorial covers loading JSON files, defining target schemas, cleaning and renaming fields, and validating the output. It demonstrates transforming customer records by removing unwanted fields and renaming others, while comparing performance between pure Python (faster for simple tasks) and pandas (better for complex datasets with built-in data cleaning methods).
83
4
Article
Salesforce Engineering·1y
How a New AI Architecture Processes 100 Million Rows in 5 Minutes
Salesforce developed a new AI-driven architecture to process over 100 million rows of advertising data in just five minutes. The Marketing Intelligence product unifies ad data from numerous sources, automates campaign performance insights, and simplifies complex data processing. By integrating with Salesforce-native technologies like Data Cloud, AgentForce, and Tableau, the system scales metadata and data processing for large volumes while maintaining low latency and high performance.
74
5
Video
YouTube·1y
the Spring Boot end-to-end tutorial (new for 2025!)
Explore the essential concepts of Spring Boot 3.4 and its 2025 updates, including auto-configuration, dependency injection, and aspect-oriented programming. Learn to set up a Spring Cloud Config Server and use Spring Batch for efficient data processing. Follow along to build a dog adoption service with these tools.
70
6
Article
Netflix TechBlog·1y
Behind the Scenes: Building a Robust Ads Event Processing Pipeline
Netflix developed a robust ads event processing pipeline to enhance digital advertising strategies. The system includes components for ad insertion, tracking, and real-time feedback to optimize ad delivery and ensure accurate reporting. Netflix's approach addresses scalability and integration with third-party vendors, leveraging technologies like Apache Kafka and Flink for data processing. The evolution into an in-house advertising platform refines capabilities like frequency capping and sessionization, improving reporting and metrics, and supporting future ad types and strategies.
63
7
Article
ByteByteGo·1y
How Netflix Built a Distributed Counter for Billions of User Interactions
Netflix uses a Distributed Counter Abstraction to efficiently track billions of user interactions. This system addresses the need for low latency, high throughput, and cost efficiency by utilizing different counting techniques tailored to various use cases. The architecture employs a hybrid approach combining event logging, background aggregation, and caching. Key benefits include scalability, reliability, and balancing trade-offs between immediacy and consistency.
62
8
Article
The Apache Software Foundation Blog·25w
The Apache Software Foundation Announces New Top-Level Projects
Apache Artemis and Apache Wayang have graduated to Top-Level Projects at the Apache Software Foundation. Artemis is a high-performance messaging platform supporting AMQP, MQTT, and STOMP protocols for microservices and cloud-native applications. Wayang is a unifying data processing framework with a cross-platform optimizer that integrates systems like Apache Flink, Apache Spark, and TensorFlow through a three-layer architecture.
35
1
9
Article
The Dev Craft·37w
How to Convert JSON to Excel in Seconds
Xcel is a tool that converts JSON files to Excel format instantly. Users can upload or paste JSON data, and the tool automatically transforms even complex nested structures into clean Excel spreadsheets without manual formatting or parsing. The tool is particularly useful for reporting, debugging, and sharing data with non-technical team members.
26
4
10
Article
Ars Technica·48w
Anthropic destroyed millions of print books to build its AI models
Anthropic physically destroyed millions of print books by cutting them from their bindings and scanning them to create training data for Claude AI. The company hired Google Books' former partnerships head to lead this massive digitization effort. A federal judge ruled this destructive scanning process constituted fair use because Anthropic legally purchased the books, destroyed each physical copy after scanning, and kept digital files internal rather than distributing them. The ruling establishes important precedent for AI training data acquisition methods.
21
6
11
Article
Open Source·1y
Pyper - Concurrent Python Made Simple
Pyper is a flexible, pure-Python framework designed for concurrent and parallel data processing. It features an intuitive API that unifies threaded, multiprocessed, and asynchronous work using functional programming principles. Pyper ensures safety by managing underlying task execution and resource clean-up, and it is optimized for efficiency with lazy execution through queues, workers, and generators.
19
12
Article
Hacker News·46w
Overthinking GIS
A developer creates a custom terrain usability metric by processing USGS elevation data using Python and OpenCV. The approach involves calculating the Laplacian (second-order derivative) of elevation data to identify steep terrain, then using a sliding window to generate average steepness values for different areas. The solution effectively becomes a complex downsampling method to classify land as buildable or too steep based on topographic line density.
19
1
13
Article
InfoWorld·1y
MarkItDown: Microsoft’s open-source tool for Markdown conversion
Microsoft has introduced MarkItDown, an open-source Python utility that converts various file formats into Markdown. The tool is designed to help with fine-tuning large language models (LLMs) and building retrieval-augmented generation (RAG) systems. MarkItDown preserves document structures, supports multi-modal data like images and audio files, and integrates with LLMs for enhanced functionality. Despite some limitations, it addresses key challenges in document processing and offers a modular and extensible architecture for developers.
18
14
Article
Data Engineering·1y
Pyper: Concurrent Python Made Simple
Pyper is a new Python package designed for concurrent and parallel data processing. It features an intuitive API, supports a functional programming paradigm, ensures safety by handling memory and thread-level errors, and is highly efficient with lazy execution. It is a pure Python package with zero dependencies.
17
15
Video
ByteByteGo·1y
Why is Kafka FAST? Part 1
CFKA achieves high throughput mainly due to its reliance on sequential IO, which is faster than random access, especially on hard drives. Utilizing append-only logs for data storage allows efficient data movement, while the cost-effectiveness of hard disks enables long-term message retention.
13
16
Article
Collections·1y
Understanding Apache Kafka: Basics and Key Features
Apache Kafka is a distributed event-streaming platform designed for real-time data processing. It manages data flow efficiently in event-driven systems with components like topics, partitions, producers, consumers, and brokers. Kafka ensures high availability through data replication and a leader-follower model. Its architecture supports data persistence and parallel processing via consumer groups. The recent introduction of Kafka Raft (KRaft) aims to simplify cluster management.
13
17
Article
The New Stack·1y
Duck DB: Query Processing Is King
DuckDB is an in-process database that simplifies query processing without focusing on data persistence. It supports multiple programming languages and is efficient for testing scenarios and on-the-fly data transformations. DuckDB is especially useful for gaining SQL query support without the need for a full database system.
13
18
Video
ByteGrad·1y
NEW RAG-App Stack Beats Previous LLM-Stack (AI-Chatbots, OpenAI File Search, ScraperAPI)
Learn how to enhance an AI model by integrating a chatbot with web scraping and data processing tools. The process involves using ScraperAPI to collect and clean website data, then leveraging OpenAI's file search and response generation capabilities. This approach ensures the chatbot can provide accurate information based on the content of the website, reducing manual intervention and improving response quality.
11
19
Article
Community Picks·1y
nuclio/nuclio: High-Performance Serverless event and data processing platform
Nuclio is a high-performance serverless framework designed for data, I/O, and compute intensive workloads. It integrates with popular data science tools like Jupyter and Kubeflow and supports various data and streaming sources, as well as execution over CPUs and GPUs. Nuclio can be used standalone in a Docker container or on top of Kubernetes. It features rapid processing capabilities and high security, with use cases in both startups and enterprises.
11
20
Article
Niklas Heidloff·1y
Unstructured Data Preparation for Generative AI
IBM's Data Prep Kit is an open-source tool for generative AI data preparation, supporting tasks like fine-tuning and retrieval augmented generation (RAG). It helps AI developers cleanse, transform, and enrich unstructured data using common Python frameworks, Ray, and Spark runtimes. The kit can handle natural language and code data, and can scale from local machines to data centers. Included are various transformers and example notebooks to guide users in data conversion, de-duplication, PII identification, and more.
10
21
Article
Real Python·1y
Working With Python Polars – Real Python
Polars is a high-performance DataFrame library for Python, designed for efficient data processing and handling large datasets. The video course introduces Polars' core features including DataFrames, expressions, contexts, reading data, grouping, aggregating, and utilizing the lazy API. The course includes 7 lessons, video subtitles, transcripts, downloadable resources, an accompanying text-based tutorial, a Q&A with Python experts, and a certificate of completion.
10

See all Data Processing archives