ExpRL: Leveraging Exploratory Reinforcement Learning for LLM Mid-Training
ExpRL introduces a novel approach to LLM mid-training by utilizing exploratory reinforcement learning to discover essential reasoning primitives, reducing the reliance on manually curated training traces for improving model reasoning capabilities.
The Challenge of Sparse Rewards in LLM Reasoning
Reinforcement Learning (RL) with sparse rewards has emerged as a cornerstone for enhancing the reasoning capabilities of Large Language Models (LLMs). However, the efficacy of these RL processes is heavily dependent on the initial coverage provided by the base model. If the model cannot naturally sample successful reasoning paths, the sparse reward signal fails to provide the necessary guidance for optimization.
Moving Beyond Manual Mid-Training
To mitigate this "cold start" problem, the current industry standard involves a "mid-training" phase. During this stage, models are trained on curated reasoning traces designed to instill primitive skills, such as:
- Decomposition: Breaking complex problems into smaller, manageable sub-tasks.
- Verification: Implementing internal checks to validate intermediate steps.
- Self-Correction: Identifying and rectifying errors during the generation process.
While effective, this traditional strategy is limited by the need for human experts to manually specify which primitives the model should learn, creating a bottleneck in the development of more generalized reasoning capabilities.
Introducing ExpRL: Exploratory RL
ExpRL proposes a shift from manually curated mid-training to an exploratory framework. By utilizing exploratory RL, the goal is to allow the model to discover these critical reasoning primitives autonomously. This approach aims to determine whether the model can identify the necessary cognitive strategies for problem-solving without the constraints of predefined human-curated traces.