STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Haipeng Luo, Qingfeng Sun, Songli Wu, Can Xu, Wenfeng Deng 2026-06-16 · 20:00 UTC

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Article automatically generated from technical news.

Reinforcement Learning with Verifiable Rewards algorithms like GRPO have emerged as the dominant post-training paradigm for complex reasoning in LLMs, yet commonly suffer from policy entropy collapse during training. We conduct a first-order gradient analysis of token-level entropy dynamics under GRPO and identify a token-level credit assignment mismatch: the per-token entropy variation decomposes into the product of the trajectory-level advantage and an entropy sensitivity function over the nex

Fonte originale

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

STARE: Surprisal-Guided Token-Level Advantage Reweighting for Policy Entropy Stability

Related Articles

Neural Networks with PyTorch and Lightning AI Part 4: From Manual Training to Automated Training

openai /skills

GLM-5.2 Is The Best Open Weight Creative Writing Model

Noam Shazeer Joins OpenAI

The Simplest AI Income Model Nobody Is Talking About (No Website, No Team, No Coding)