Text degeneration — where autoregressive language models enter repetition loops and never emit an EOS token — is a structural failure mode rooted in the maximum-likelihood training objective. In production OCR workloads, fewer than 3% of degenerate requests inflated total batch wall-clock time by over 42%, because looping sequences monopolize GPU memory and reduce parallelism for all co-running requests. Standard benchmarks track output quality but ignore degeneration rate entirely, making two models with near-identical quality scores potentially differ by an order of magnitude in production cost. The authors propose tracking degeneration rate as a first-class metric alongside latency and throughput. Inference-layer mitigations (repetition penalties, early abort, retries) help but are partial and carry their own overhead. A two-stage training pipeline — SFT followed by DPO using degenerate outputs as rejected examples — reduced degeneration rates by 37–88% across five model families, with smaller specialized models outperforming larger general-purpose ones on stability.

15m read timeFrom huggingface.co
Post cover image
Table of contents
The Anomaly in the Inference LogWhy Degeneration Is Structural, Not ConfigurableThe Cost Multiplier Hiding in Plain SightThe Benchmark Blind SpotWhy Mitigation Is Itself a TaxThe Specialization–Stability LinkReframing Evaluation and ObservabilityWhat Changes When You Start Measuring ThisSources:

Sort: