IBM's Data Prep Kit is an open-source tool for generative AI data preparation, supporting tasks like fine-tuning and retrieval augmented generation (RAG). It helps AI developers cleanse, transform, and enrich unstructured data using common Python frameworks, Ray, and Spark runtimes. The kit can handle natural language and code data, and can scale from local machines to data centers. Included are various transformers and example notebooks to guide users in data conversion, de-duplication, PII identification, and more.
Sort: