The article examines the "alignment tax" — the performance tradeoff incurred when aligning LLMs to human preferences via methods like RLHF and DPO. It argues that current alignment techniques prioritize agreeableness over truthful reasoning, causing models to hedge, refuse, or deflect rather than think critically. The piece contrasts Direct Preference Optimization (DPO) with Reinforcement Learning from Human Feedback (RLHF) as competing approaches to this fundamental tension.

Read original