Best of Data Processing2025

  1. 1
    Video
    Avatar of bytebytegoByteByteGo·1y

    What Is the Most Popular Open-Source AI Stack?

    Open-source AI provides freedom to experiment and develop without proprietary restrictions with frameworks and tools like Next.js, Streamlit, Gradio, and FastAPI. The data layer involves retrieval-augmented generation (RAG), vector databases, and tools for diverse file formats. The back end includes FastAPI, Langchain, Metaflow, and OLama, facilitating scalable AI operations. The ecosystem also includes community-driven models from Hugging Face and dynamic LLMs like Mistral and DeepSeek.

  2. 2
    Article
    Avatar of tinybirdTinybird·1y

    Local first.

    Tinybird introduces Tinybird Local, a Docker container that allows developers to run a full instance of Tinybird's data processing platform on their laptops. This local-first approach enables development, testing, and deployment of data applications both locally and in the cloud seamlessly. The container includes core Tinybird functionalities and several optimizations for performance but lacks some cloud-specific features. The initiative aims to provide a more controlled, offline, and versatile development environment.

  3. 3
    Article
    Avatar of freecodecampfreeCodeCamp·46w

    How to Transform JSON Data to Match Any Schema

    Learn how to transform JSON data to match specific schemas using two approaches: pure Python and pandas. The tutorial covers loading JSON files, defining target schemas, cleaning and renaming fields, and validating the output. It demonstrates transforming customer records by removing unwanted fields and renaming others, while comparing performance between pure Python (faster for simple tasks) and pandas (better for complex datasets with built-in data cleaning methods).

  4. 4
    Article
    Avatar of salesforceengSalesforce Engineering·1y

    How a New AI Architecture Processes 100 Million Rows in 5 Minutes

    Salesforce developed a new AI-driven architecture to process over 100 million rows of advertising data in just five minutes. The Marketing Intelligence product unifies ad data from numerous sources, automates campaign performance insights, and simplifies complex data processing. By integrating with Salesforce-native technologies like Data Cloud, AgentForce, and Tableau, the system scales metadata and data processing for large volumes while maintaining low latency and high performance.

  5. 5
    Video
    Avatar of youtubeYouTube·1y

    the Spring Boot end-to-end tutorial (new for 2025!)

    Explore the essential concepts of Spring Boot 3.4 and its 2025 updates, including auto-configuration, dependency injection, and aspect-oriented programming. Learn to set up a Spring Cloud Config Server and use Spring Batch for efficient data processing. Follow along to build a dog adoption service with these tools.

  6. 6
    Article
    Avatar of netflixNetflix TechBlog·1y

    Behind the Scenes: Building a Robust Ads Event Processing Pipeline

    Netflix developed a robust ads event processing pipeline to enhance digital advertising strategies. The system includes components for ad insertion, tracking, and real-time feedback to optimize ad delivery and ensure accurate reporting. Netflix's approach addresses scalability and integration with third-party vendors, leveraging technologies like Apache Kafka and Flink for data processing. The evolution into an in-house advertising platform refines capabilities like frequency capping and sessionization, improving reporting and metrics, and supporting future ad types and strategies.

  7. 7
    Article
    Avatar of bytebytegoByteByteGo·1y

    How Netflix Built a Distributed Counter for Billions of User Interactions

    Netflix uses a Distributed Counter Abstraction to efficiently track billions of user interactions. This system addresses the need for low latency, high throughput, and cost efficiency by utilizing different counting techniques tailored to various use cases. The architecture employs a hybrid approach combining event logging, background aggregation, and caching. Key benefits include scalability, reliability, and balancing trade-offs between immediacy and consistency.

  8. 8
    Article
    Avatar of apacheThe Apache Software Foundation Blog·25w

    The Apache Software Foundation Announces New Top-Level Projects

    Apache Artemis and Apache Wayang have graduated to Top-Level Projects at the Apache Software Foundation. Artemis is a high-performance messaging platform supporting AMQP, MQTT, and STOMP protocols for microservices and cloud-native applications. Wayang is a unifying data processing framework with a cross-platform optimizer that integrates systems like Apache Flink, Apache Spark, and TensorFlow through a three-layer architecture.

  9. 9
    Article
    Avatar of thedevcraftThe Dev Craft·37w

    How to Convert JSON to Excel in Seconds

    Xcel is a tool that converts JSON files to Excel format instantly. Users can upload or paste JSON data, and the tool automatically transforms even complex nested structures into clean Excel spreadsheets without manual formatting or parsing. The tool is particularly useful for reporting, debugging, and sharing data with non-technical team members.

  10. 10
    Article
    Avatar of arstechnicaArs Technica·48w

    Anthropic destroyed millions of print books to build its AI models

    Anthropic physically destroyed millions of print books by cutting them from their bindings and scanning them to create training data for Claude AI. The company hired Google Books' former partnerships head to lead this massive digitization effort. A federal judge ruled this destructive scanning process constituted fair use because Anthropic legally purchased the books, destroyed each physical copy after scanning, and kept digital files internal rather than distributing them. The ruling establishes important precedent for AI training data acquisition methods.

  11. 11
    Article
    Avatar of opensourcesquadOpen Source·1y

    Pyper - Concurrent Python Made Simple

    Pyper is a flexible, pure-Python framework designed for concurrent and parallel data processing. It features an intuitive API that unifies threaded, multiprocessed, and asynchronous work using functional programming principles. Pyper ensures safety by managing underlying task execution and resource clean-up, and it is optimized for efficiency with lazy execution through queues, workers, and generators.

  12. 12
    Article
    Avatar of hnHacker News·46w

    Overthinking GIS

    A developer creates a custom terrain usability metric by processing USGS elevation data using Python and OpenCV. The approach involves calculating the Laplacian (second-order derivative) of elevation data to identify steep terrain, then using a sliding window to generate average steepness values for different areas. The solution effectively becomes a complex downsampling method to classify land as buildable or too steep based on topographic line density.

  13. 13
    Article
    Avatar of infoworldInfoWorld·1y

    MarkItDown: Microsoft’s open-source tool for Markdown conversion

    Microsoft has introduced MarkItDown, an open-source Python utility that converts various file formats into Markdown. The tool is designed to help with fine-tuning large language models (LLMs) and building retrieval-augmented generation (RAG) systems. MarkItDown preserves document structures, supports multi-modal data like images and audio files, and integrates with LLMs for enhanced functionality. Despite some limitations, it addresses key challenges in document processing and offers a modular and extensible architecture for developers.

  14. 14
    Article
    Avatar of dataengineeringData Engineering·1y

    Pyper: Concurrent Python Made Simple

    Pyper is a new Python package designed for concurrent and parallel data processing. It features an intuitive API, supports a functional programming paradigm, ensures safety by handling memory and thread-level errors, and is highly efficient with lazy execution. It is a pure Python package with zero dependencies.

  15. 15
    Video
    Avatar of bytebytegoByteByteGo·1y

    Why is Kafka FAST? Part 1

    CFKA achieves high throughput mainly due to its reliance on sequential IO, which is faster than random access, especially on hard drives. Utilizing append-only logs for data storage allows efficient data movement, while the cost-effectiveness of hard disks enables long-term message retention.

  16. 16
    Article
    Avatar of collectionsCollections·1y

    Understanding Apache Kafka: Basics and Key Features

    Apache Kafka is a distributed event-streaming platform designed for real-time data processing. It manages data flow efficiently in event-driven systems with components like topics, partitions, producers, consumers, and brokers. Kafka ensures high availability through data replication and a leader-follower model. Its architecture supports data persistence and parallel processing via consumer groups. The recent introduction of Kafka Raft (KRaft) aims to simplify cluster management.

  17. 17
    Article
    Avatar of newstackThe New Stack·1y

    Duck DB: Query Processing Is King

    DuckDB is an in-process database that simplifies query processing without focusing on data persistence. It supports multiple programming languages and is efficient for testing scenarios and on-the-fly data transformations. DuckDB is especially useful for gaining SQL query support without the need for a full database system.

  18. 18
    Video
    Avatar of bytegradByteGrad·1y

    NEW RAG-App Stack Beats Previous LLM-Stack (AI-Chatbots, OpenAI File Search, ScraperAPI)

    Learn how to enhance an AI model by integrating a chatbot with web scraping and data processing tools. The process involves using ScraperAPI to collect and clean website data, then leveraging OpenAI's file search and response generation capabilities. This approach ensures the chatbot can provide accurate information based on the content of the website, reducing manual intervention and improving response quality.

  19. 19
    Article
    Avatar of communityCommunity Picks·1y

    nuclio/nuclio: High-Performance Serverless event and data processing platform

    Nuclio is a high-performance serverless framework designed for data, I/O, and compute intensive workloads. It integrates with popular data science tools like Jupyter and Kubeflow and supports various data and streaming sources, as well as execution over CPUs and GPUs. Nuclio can be used standalone in a Docker container or on top of Kubernetes. It features rapid processing capabilities and high security, with use cases in both startups and enterprises.

  20. 20
    Article
    Avatar of heidloffNiklas Heidloff·1y

    Unstructured Data Preparation for Generative AI

    IBM's Data Prep Kit is an open-source tool for generative AI data preparation, supporting tasks like fine-tuning and retrieval augmented generation (RAG). It helps AI developers cleanse, transform, and enrich unstructured data using common Python frameworks, Ray, and Spark runtimes. The kit can handle natural language and code data, and can scale from local machines to data centers. Included are various transformers and example notebooks to guide users in data conversion, de-duplication, PII identification, and more.

  21. 21
    Article
    Avatar of rpythonReal Python·1y

    Working With Python Polars – Real Python

    Polars is a high-performance DataFrame library for Python, designed for efficient data processing and handling large datasets. The video course introduces Polars' core features including DataFrames, expressions, contexts, reading data, grouping, aggregating, and utilizing the lazy API. The course includes 7 lessons, video subtitles, transcripts, downloadable resources, an accompanying text-based tutorial, a Q&A with Python experts, and a certificate of completion.