Best of Data ProcessingApril 2025

  1. 1
    Article
    Avatar of salesforceengSalesforce Engineering·1y

    How a New AI Architecture Processes 100 Million Rows in 5 Minutes

    Salesforce developed a new AI-driven architecture to process over 100 million rows of advertising data in just five minutes. The Marketing Intelligence product unifies ad data from numerous sources, automates campaign performance insights, and simplifies complex data processing. By integrating with Salesforce-native technologies like Data Cloud, AgentForce, and Tableau, the system scales metadata and data processing for large volumes while maintaining low latency and high performance.

  2. 2
    Article
    Avatar of infoworldInfoWorld·1y

    MarkItDown: Microsoft’s open-source tool for Markdown conversion

    Microsoft has introduced MarkItDown, an open-source Python utility that converts various file formats into Markdown. The tool is designed to help with fine-tuning large language models (LLMs) and building retrieval-augmented generation (RAG) systems. MarkItDown preserves document structures, supports multi-modal data like images and audio files, and integrates with LLMs for enhanced functionality. Despite some limitations, it addresses key challenges in document processing and offers a modular and extensible architecture for developers.

  3. 3
    Video
    Avatar of bytegradByteGrad·1y

    NEW RAG-App Stack Beats Previous LLM-Stack (AI-Chatbots, OpenAI File Search, ScraperAPI)

    Learn how to enhance an AI model by integrating a chatbot with web scraping and data processing tools. The process involves using ScraperAPI to collect and clean website data, then leveraging OpenAI's file search and response generation capabilities. This approach ensures the chatbot can provide accurate information based on the content of the website, reducing manual intervention and improving response quality.

  4. 4
    Article
    Avatar of heidloffNiklas Heidloff·1y

    Unstructured Data Preparation for Generative AI

    IBM's Data Prep Kit is an open-source tool for generative AI data preparation, supporting tasks like fine-tuning and retrieval augmented generation (RAG). It helps AI developers cleanse, transform, and enrich unstructured data using common Python frameworks, Ray, and Spark runtimes. The kit can handle natural language and code data, and can scale from local machines to data centers. Included are various transformers and example notebooks to guide users in data conversion, de-duplication, PII identification, and more.