Learn how to create a custom data curation pipeline using NVIDIA NeMo Curator. The tutorial walks you through the process of downloading and curating the TinyStories dataset, as well as applying text cleaning, dataset filtering, deduplication, and PII redaction. The pipeline allows you to tailor data curation to fit your project's needs and ensure data quality and privacy.

11m read timeFrom developer.nvidia.com
Post cover image
Table of contents
OverviewPrerequisiteDefining custom document buildersDownloading the TinyStories datasetText cleaning and unificationDataset filteringDeduplicationPII redactionPutting the curation pipeline togetherNext steps

Sort: