Text degeneration — where autoregressive language models enter repetition loops and never emit an EOS token — is a structural failure mode rooted in the maximum-likelihood training objective. In production OCR workloads, fewer than 3% of degenerate requests inflated total batch wall-clock time by over 42%, because looping sequences monopolize GPU memory and reduce parallelism for all co-running requests. Standard benchmarks track output quality but ignore degeneration rate entirely, making two models with near-identical quality scores potentially differ by an order of magnitude in production cost. The authors propose tracking degeneration rate as a first-class metric alongside latency and throughput. Inference-layer mitigations (repetition penalties, early abort, retries) help but are partial and carry their own overhead. A two-stage training pipeline — SFT followed by DPO using degenerate outputs as rejected examples — reduced degeneration rates by 37–88% across five model families, with smaller specialized models outperforming larger general-purpose ones on stability.
Table of contents
The Anomaly in the Inference LogWhy Degeneration Is Structural, Not ConfigurableThe Cost Multiplier Hiding in Plain SightThe Benchmark Blind SpotWhy Mitigation Is Itself a TaxThe Specialization–Stability LinkReframing Evaluation and ObservabilityWhat Changes When You Start Measuring ThisSources:Sort: