Text Degeneration: A Production Failure Mode That Most Benchmarks Do Not Track

Text degeneration — where autoregressive language models enter repetition loops and never emit an EOS token — is a structural failure mode rooted in the maximum-likelihood training objective. In production OCR workloads, fewer than 3% of degenerate requests inflated total batch wall-clock time by over 42%, because looping sequences monopolize GPU memory and reduce parallelism for all co-running requests. Standard benchmarks track output quality but ignore degeneration rate entirely, making two models with near-identical quality scores potentially differ by an order of magnitude in production cost. The authors propose tracking degeneration rate as a first-class metric alongside latency and throughput. Inference-layer mitigations (repetition penalties, early abort, retries) help but are partial and carry their own overhead. A two-stage training pipeline — SFT followed by DPO using degenerate outputs as rejected examples — reduced degeneration rates by 37–88% across five model families, with smaller specialized models outperforming larger general-purpose ones on stability.

#ai-inference

#vllm

May 22•15m read time•From huggingface.co

Table of contents

The Anomaly in the Inference Log Why Degeneration Is Structural, Not Configurable The Cost Multiplier Hiding in Plain Sight The Benchmark Blind Spot Why Mitigation Is Itself a Tax The Specialization–Stability Link Reframing Evaluation and Observability What Changes When You Start Measuring This Sources:

Comment

Bookmark

Copy

Sort: