NVIDIA outlines its open data strategy for AI development, having released over 2 petabytes of training data across 180+ datasets on Hugging Face. Key releases include the Physical AI Collection (500K+ robotics trajectories, 15TB multimodal data), Nemotron Personas (synthetic population-scale datasets for sovereign AI across multiple countries), La Proteina (455K synthetic protein structures for drug discovery), SPEED-Bench (speculative decoding benchmark), Retrieval-Synthetic-NVDocs-v1 (RAG training data), and Nemotron-ClimbMix (400B-token pre-training dataset). The post also details the Nemotron pre- and post-training dataset stacks used to train frontier models, and describes NVIDIA's 'extreme co-design' philosophy of releasing datasets alongside methods to enable community iteration.

8m read timeFrom huggingface.co
Post cover image
Table of contents
AI-Data BottlenecksReal-World Open DatasetsNemotron Training DatasetsExtreme Co-DesignStart Cooking in the Open Kitchen

Sort: