Best of Data Processing2024

  1. 1
    Article
    Avatar of communityCommunity Picks·2y

    9 Software Architecture Patterns for Distributed Systems

    In modern software development, distributed systems require efficient design to manage data and communication between components. Key architectural patterns like Peer-to-Peer, API Gateway, Pub-Sub, Request-Response, Event Sourcing, ETL, Batching, Streaming Processing, and Orchestration offer solutions for reliability, scalability, and maintainability. These patterns are essential not only for system robustness but also for system design interviews, providing a deep understanding of their strengths and trade-offs.

  2. 2
    Article
    Avatar of javacodegeeksJava Code Geeks·2y

    Java Streams: 5 Powerful Techniques You Might Not Know

    Discover 5 powerful techniques for Java Streams that can enhance your code readability, maintainability, and processing efficiency.

  3. 3
    Article
    Avatar of taiTowards AI·2y

    The Best Practices of RAG

    Explores the process of retrieval-augmented generation (RAG) and outlines best practices for its various components. Discusses query classification, efficient document retrieval, re-ranking for relevance, re-packing into structured formats, and summarization to extract key information. The post also provides a comprehensive evaluation of these practices and concludes with insights and recommendations.

  4. 4
    Article
    Avatar of detlifeData Engineer Things·2y

    I spent 8 hours learning the details of the Apache Spark scheduling process.

    The post delves into the details of the Apache Spark scheduling process. It covers the anatomy of a Spark job, stages, tasks, and the Directed Acyclic Graph (DAG) scheduler. It explains how SparkContext initiates scheduling, the roles of TaskScheduler and SchedulerBackend, and the concept of data locality in task execution. The post also discusses speculative execution to handle slow tasks and the entire end-to-end scheduling process in Spark.

  5. 5
    Article
    Avatar of tinybirdTinybird·2y

    Best practices for timestamps and time zones in databases

    The post provides best practices for managing timestamps and time zones in databases, emphasizing the importance of using UTC for storing historical timestamps. It discusses avoiding unnecessary complexity, ensuring unambiguous time representations, using appropriate data types, understanding time zone relationships, and leveraging system-provided functions for time conversions. The guide underscores the need for careful data transformation and thorough testing to avoid errors in time-based analytics.

  6. 6
    Article
    Avatar of systemdesigncodexSystem Design Codex·2y

    Kafka Load Balancing at Agoda for Terabytes of Data

    Agoda uses Kafka to manage hundreds of terabytes of data across various supply systems, including hotels and restaurants, ensuring real-time price updates. They faced challenges with the traditional round-robin partitioning and consumer assignment due to heterogeneous hardware and uneven workloads, resulting in over-provisioning. Agoda addressed these issues by implementing dynamic, lag-aware strategies, including a lag-aware producer and consumer, to optimize message distribution and reduce latency.

  7. 7
    Article
    Avatar of hnHacker News·2y

    IronCalc

    Spreadsheets have been vital for decades, yet finding a universally accessible and high-quality engine remains difficult. IronCalc aims to provide an open-source spreadsheet engine to assist SaaS developers, enable automated spreadsheet processing, support global collaboration, and allow bloggers to embed interactive spreadsheets. Beyond code, IronCalc focuses on advancing spreadsheet technology through research, community collaboration, and building a knowledge base for future developers.

  8. 8
    Article
    Avatar of lobstersLobsters·2y

    CSVs Are Kinda Bad. DSVs Are Kinda Good.

    CSVs often pose challenges with different delimiters, escape characters, and newline conventions, leading to malformed data and parsing issues. Using ASCII control characters as delimiters, like unit and record separators, can simplify data parsing by avoiding conflicts with printable characters. However, there is limited tool support for these delimiters compared to CSVs, which are widely supported despite their fragility.

  9. 9
    Article
    Avatar of mlnewsMachine Learning News·2y

    OmniParse: An AI Platform that Ingests/Parses Any Unstructured Data into Structured, Actionable Data Optimized for GenAI (LLM) Applications

    OmniParse is an AI platform designed to convert various unstructured data types, including documents, images, audio, video, and web content, into structured, actionable data. It supports around 20 different file types and operates entirely locally, ensuring data privacy. OmniParse deploys easily using Docker and Skypilot and works with platforms like Colab. It uses advanced models such as Surya OCR and Whisper, achieving high accuracy and efficiency in data conversion, optimizing it for Generative AI applications.

  10. 10
    Article
    Avatar of medium_jsMedium·2y

    High-Performance Python Data Processing: pandas 2 vs. Polars, a vCPU Perspective

    Polars is emerging as a strong competitor to pandas for Python data analysis, boasting significant performance improvements due to its Rust backend optimized for parallel processing and vectorized operations. This post tests Polars against pandas with varying vCores, finding Polars generally faster, though it encounters some challenges with single vCore setups. While Polars shows great promise, considerations like cost, compatibility, and maturity remain important when evaluating a switch from pandas.

  11. 11
    Article
    Avatar of materializedviewMaterialized View·1y

    S3 Is the New SFTP

    Fintech companies handle diverse data processing tasks, including shuffling files between vendors and partners, often using SFTP. Transitioning to modern data lakehouses using S3, Apache Iceberg, and Apache Parquet can centralize and streamline this process. This new method allows ease of access and management while maintaining advantages such as fast transfers and central access control. Although challenges like schema evolution remain, adopting data lakehouses can benefit companies seeking efficient and scalable data solutions. The trend is supported by rising customer demand and the involvement of startups providing innovative data export platforms.

  12. 12
    Article
    Avatar of communityCommunity Picks·2y

    Building an Advanced RAG System With Self-Querying Retrieval

    Learn how to build an advanced Retrieval Augmented Generation (RAG) system that leverages self-querying retrieval to improve search relevance. This tutorial covers extracting metadata filters from natural language queries, combining metadata filtering with vector search, and generating structured outputs using LLMs. The guide focuses on developing an investment assistant to answer financial questions using MongoDB as the vector store and LangGraph for orchestration.

  13. 13
    Article
    Avatar of javarevisitedJavarevisited·2y

    The 2024 Data Scientist RoadMap

    An illustrated guide to becoming a Data Scientist in 2024 with links to relevant courses

  14. 14
    Article
    Avatar of mlmMachine Learning Mastery·1y

    Building a Graph RAG System: A Step-by-Step Approach

    Graph RAG is gaining popularity for its ability to organize retrieved data as a graph, connecting documents through nodes and edges to provide comprehensive and insightful responses. This method addresses the limitations of traditional Retrieval-Augmented Generation (RAG) systems, which often fail to connect fragmented information across multiple documents. The post details the step-by-step implementation of Graph RAG using LlamaIndex, including key processes like breaking down documents into text chunks, identifying nodes and edges, summarizing elements, and building communities for more effective data reasoning and responses.

  15. 15
    Article
    Avatar of communityCommunity Picks·2y

    Kafka Migration and Event Streaming

    Apache Kafka is an open-source distributed event and stream-processing platform known for its scalability and high throughput. This tutorial guides you through expanding a Kafka cluster by adding a new node and migrating topic partitions for optimal resource utilization using both manual scripts and Kafka Cruise Control. It also covers aggregating event data with ksqlDB, a database that operates on top of Kafka topics using SQL-like syntax. The tutorial includes Docker Compose configurations, command line instructions, and detailed steps for setting up and verifying the expanded Kafka cluster and ksqlDB integration.

  16. 16
    Article
    Avatar of nvidiadevNVIDIA Developer·2y

    Mastering LLM Techniques: Data Preprocessing

    Large language models (LLMs) significantly enhance efficiency by automating tasks, but their performance heavily depends on high-quality data. Effective data preprocessing—such as text cleaning, deduplication, and quality filtering—is crucial to ensure optimal model accuracy. Techniques like leveraging synthetic data generation and tools like NVIDIA NeMo Curator can help overcome common challenges such as data scarcity, reducing toxics, and managing vast datasets efficiently. NeMo Curator's use of GPU-accelerated libraries enhances the speed and efficiency of data processing workflows.

  17. 17
    Article
    Avatar of towardsdevTowards Dev·2y

    What Is a Streaming Database?

    A streaming database is designed to process large amounts of real-time streaming data, providing real-time insights and analysis. It is ideal for latency-critical applications such as real-time analytics, fraud detection, network monitoring, and the Internet of Things (IoT). Streaming databases differ from traditional databases in their processing approach and can be used alongside other data systems for streaming ingestion and streaming analytics. They also differ from OLTP and OLAP databases in terms of ACID compliance, data correctness, and query optimization.

  18. 18
    Article
    Avatar of bytebytegoByteByteGo·1y

    How Statsig Streams 1 Trillion Events A Day

    Statsig processes over a trillion events daily for high-profile clients such as OpenAI and Atlassian, with a robust data pipeline designed for scalability and cost-efficiency. Key components include a reliable data ingestion layer, scalable message queues, and effective routing and integration techniques. Their strategy involves using Google Cloud Storage, Pub/Sub, spot nodes, and advanced compression methods to optimize performance and minimize costs, ensuring high reliability and low latency.

  19. 19
    Article
    Avatar of mdnblogMDN Blog·2y

    Efficient data handling with the Streams API

    The Streams API allows efficient data handling in JavaScript by enabling processing of data as it arrives, making it suitable for continuous data sources and real-time applications. Key concepts include chunks, backpressure, and piping, and the API includes abstractions like ReadableStream, WritableStream, and TransformStream. The post provides a practical example of building a Node.js application to transform data streams and explores various real-world use cases such as video streaming, data visualization, and file transfer systems.

  20. 20
    Article
    Avatar of sqlshackSQL Shack·2y

    Finding Duplicates in SQL

    This post explains the different ways to find duplicate values in SQL using DISTINCT and COUNT, GROUP BY and COUNT, and ROW_NUMBER functions. It provides examples and guidance on how to use these functions to identify duplicates in single columns or across multiple columns. The post also highlights the importance of managing duplicates in data storage and processing.

  21. 21
    Article
    Avatar of medium_jsMedium·2y

    My First Billion (of Rows) in DuckDB

    The post describes the author's experience with DuckDB, a database for processing large volumes of data locally. It covers the problem of processing logs of Brazilian electronic ballot boxes and the challenges involved. The post explains the features and advantages of DuckDB and provides a step-by-step implementation of data processing. It concludes with the author's evaluation of DuckDB's performance and usability.

  22. 22
    Article
    Avatar of communityCommunity Picks·2y

    AI engineering requires no academia or ML – just problem-solving

    AI engineering doesn't require academia or machine learning expertise. Tejas Kumar, an AI DevRel Engineer at DataStax, emphasizes that it involves applying AI to solve problems, often through AI API requests. Key techniques include fine-tuning transfer learning and optimizing model architecture to reduce costs. To mitigate AI hallucinations, Kumar recommends Retrieval-Augmented Generation (RAG), and to ensure privacy, running models locally using tools like LLVM or llama.cpp. More insights will be shared at the Shift Conference in Zadar.

  23. 23
    Article
    Avatar of newstackThe New Stack·2y

    Boost LLM Results: When to Use Knowledge Graph RAG

    Retrieval-augmented generation (RAG) systems sometimes fail to go deep enough into document sets, leading to shallow or incorrect responses. Using knowledge graphs can enhance RAG systems by connecting related documents more effectively. This method is especially useful for legal documents, technical documentation, research publications, and interconnected websites. Knowledge graphs use well-defined connections like HTML links, specialized keywords, and document structures to improve information retrieval and accuracy.

  24. 24
    Article
    Avatar of medium_jsMedium·2y

    How We Solve Load Balancing Challenges in Apache Kafka

    This post discusses the challenges of load balancing in Apache Kafka and presents solutions, such as lag-aware producers and consumers, to address these challenges.

  25. 25
    Article
    Avatar of communityCommunity Picks·1y

    bodo-ai/Bodo: High-Performance Python Compute Engine for Data and AI

    Bodo is a revolutionary compute engine that significantly boosts the performance of Python programs for data processing and AI applications. It leverages an auto-parallelizing JIT compiler to convert Python code into optimized, parallel binaries, making it 20x to 240x faster than traditional frameworks. Bodo supports native Python APIs like Pandas and NumPy, integrates with modern data infrastructure such as Apache Iceberg and Snowflake, and is compatible with existing Python ecosystems. It's designed for data-intensive workloads and can be easily installed via Pip or Conda.