Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Runpeng Dai, Tong Zheng, Rui Liu, Chengsong Huang, Hongtu Zhu 2026-06-01 · 23:42 UTC

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Article automatically generated from technical news.

Test-time scaling improves the reasoning performance of large language models but incurs substantial cost in both total computation and latency. Existing adaptive sampling methods partially mitigate this issue by dynamically deciding when to stop sampling, yet they typically rely on heuristic rules or rely on distribution assumptions. In this work, we formulate adaptive sampling as a Markov decision process (MDP). We train a lightweight sampling controller with reinforcement learning (RL) to joi

Fonte originale

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Small RL Controller, Large Language Model: RL-Guided Adaptive Sampling for Test-Time Scaling

Related Articles

Bedrock Codex, Robust MILP, Multi‑Model Deliberation, Tree‑Based Molecule Ops, and MoE Quantization

0xPlaygrounds /rig

0x4m4 /hexstrike-ai

Google ordered to put clearer links in AI search and let UK publishers opt out

graykode /abtop