NVIDIA presents a concept-driven synthetic data generation workflow for LLM pretraining, producing a dataset of 15 million Python programming problems called Code Concepts. The approach uses a hierarchical taxonomy of programming concepts derived from annotating existing code datasets, then generates problems by combining selected concepts. Targeting 91 core concepts aligned with the HumanEval benchmark, the dataset was validated by including 10 billion tokens into the final 100 billion token pretraining run of Nemotron-Nano-v3, yielding a six-point HumanEval improvement (73 to 79). Both the dataset and taxonomy are released under CC-BY-4.0.

4m read timeFrom huggingface.co
Post cover image

Sort: