Hugging Face introduces Cosmopedia, the largest open synthetic dataset to date, with over 25 billion tokens and 30 million files. It aims to provide comprehensive synthetic data of excellent quality through a combination of curated sources and web data. The team employed various techniques to ensure diversity and enhance the performance of the generated prompts.
Sort: