Holy shit... RLHF alignment is quietly breaking the most...

Mar 27•From x.com

Robert Youssef @rryssf_

Holy shit... RLHF alignment is quietly breaking the most important safety feature AI systems have: knowing when they don't know something. > On 40–79% of factual questions, aligned models generate the exact same answer 10 times in a row. Not similar answers. The same answer. The model has been trained so hard to be consistent that it's lost the ability to express uncertainty through sampling which is how every major uncertainty detection system works. > The setup: generate 10 independent samples from the same question at temperature 1.0. Count how many distinct semantic clusters emerge. A model with genuine uncertainty produces diverse answers. A model that has been RLHF-aligned produces one. On TruthfulQA (790 questions), Qwen3-14B collapses to a single semantic cluster on 28.5% of questions under Jaccard clustering and 79% under embedding-based clustering. The base model (same architecture, no alignment): 1.0% single-cluster rate. > The causal mechanism is isolated. Base model: 1.0% single-cluster rate. After SFT: 1.5%. After DPO: 4.0% on Zephyr, 28.5% on Qwen3-14B. SFT barely touches diversity. DPO is the driver. The preference optimization stage the one that makes models helpful, harmless, and honest is the exact stage that breaks uncertainty estimation. On single-cluster questions, every sampling-based uncertainty method scores AUROC=0.500. Literally random. → Qwen3-14B base: 1.0% single-cluster rate → Qwen3-14B instruct: 28.5% (p < 10⁻⁶) → Sampling-based uncertainty on single-cluster questions: AUROC = 0.500 — random guessing → Free token entropy on same questions: AUROC = 0.603 still works → NLI-based semantic entropy with 435M DeBERTa model: AUROC = 0.511 still random → Scaling NLI model 6.2× (70M → 435M): zero improvement the bottleneck is the model, not the detector → Alignment tax varies 50× across families: Qwen3-14B 28.5% vs. Tulu-3 0.5% → GSM8K (math): token entropy AUROC = 0.724, Cohen's d = 0.81 alignment doesn't suppress math uncertainty the same way → Selective prediction on GSM8K: accuracy jumps from 84.4% to 93.2% at 50% coverage > Why token entropy still works: RLHF suppresses inter-response diversity but can't fully smooth per-token computational uncertainty without degrading generation quality. The model generates the same answer every time but the internal confidence over each next token is still variable and still predictive. Sampling-based methods measure diversity between outputs. Token entropy measures uncertainty inside a single forward pass. Alignment kills the first. It can't fully kill the second. > The practical implication lands hard. Every major AI safety deployment uses sampling-based uncertainty to decide when to abstain, escalate, or flag. SelfCheckGPT, Semantic Entropy, SINdex all of them rely on response diversity. On 40–79% of factual questions from aligned models, that diversity doesn't exist. The uncertainty signal is structurally zero. The model reports high confidence. It might be wrong. > The paper calls this the "alignment tax." You pay it every time DPO makes a model more consistent because consistency and calibrated uncertainty are in direct tension. The models that feel most reliable are the ones most likely to be silently wrong.

Comment

Bookmark

Copy

Sort: