dltHub

Production LLM agent traces contain rich domain-specific data but require a pipeline to become usable training data. This walkthrough shows how to use dlt to extract and normalize traces from any source (Postgres, S3, BigQuery, REST APIs), land them as versioned Parquet datasets on Hugging Face, and then use Distil Labs to generate synthetic training data and fine-tune a compact specialist model. Results show a fine-tuned Qwen3-0.6B student model achieving 78.3% tool call equivalence, beating a 120B teacher model at 50.6%, while being 200x smaller with sub-50ms local inference. The post also outlines a path to a continuous fine-tuning loop where incremental trace loads automatically trigger retraining as production traffic evolves.

Link icon

The two problems that block most fine-tuning projects Link icon

How the pipeline works: dlt → Hugging Face → Distil Labs Link icon

From traces to training data to deployed model Link icon