Holistic Data Scheduler for LLM Pre‑Training via Multi‑Objective Reinforcement Learning

A novel framework that applies multi‑objective reinforcement learning to dynamically schedule data mixtures during large language model (LLM) pre‑training, aiming to balance diversity, coverage, and efficiency.

Background

Large language models rely heavily on the composition of their training corpora. Traditional static data mixes often fail to adapt to the evolving learning dynamics of the model, leading to sub‑optimal convergence and resource utilization. Online Data Mixing (ODM) has emerged as a technique to adjust data mixtures on the fly, yet existing ODM approaches typically optimize a single objective, such as perplexity or loss reduction.

Proposed Approach

The paper introduces a holistic scheduler that employs multi‑objective reinforcement learning (MORL) to navigate the trade‑offs inherent in LLM pre‑training. The scheduler treats each data source as an action, with the reinforcement learner updating a policy that maximizes a weighted combination of objectives:

  • Diversity of source tokens
  • Coverage of domain‑specific terminology
  • Training efficiency (e.g., steps to convergence)

By framing data selection as a Markov Decision Process, the algorithm learns to adaptively re‑weight data sources in response to the model’s current state, thereby steering the training trajectory toward a more balanced representation space.

Reward Design

Rewards are computed from a composite signal that aggregates:

  1. Reduction in validation perplexity
  2. Increase in vocabulary coverage metrics
  3. Decrease in wasted compute measured by plateau detection

These signals are normalized and weighted according to user‑defined priorities, allowing researchers to emphasize specific objectives.

Policy Architecture

The policy network is a lightweight transformer encoder that ingests the current model checkpoint embeddings and recent gradient statistics, outputting a probability distribution over data sources. Exploration is encouraged via an epsilon‑greedy strategy, ensuring that novel data sources are periodically sampled.

Experimental Findings

Preliminary experiments on standard benchmarks (e.g., Wikitext‑103 and CC‑100) demonstrate:

  • Up to 12% reduction in steps to reach a target perplexity
  • Improved token coverage by 8–10% relative to static mixes
  • Robustness to noisy or low‑quality data sources

These results suggest that MORL‑based scheduling can yield both efficiency gains and richer representations.

Limitations & Future Work

Details on the exact reward weighting scheme, hyperparameter settings, and scalability to billions‑parameter models are not fully disclosed in the available description. Consequently, reproducibility of the reported gains requires further clarification from the authors.

Conclusion

The integration of multi‑objective reinforcement learning into data scheduling represents a promising direction for LLM pre‑training, offering a principled way to balance competing objectives and improve training efficiency. Further empirical validation and open‑source implementation will be critical to assess its practical impact.

Original Source

LLM Reinforcement Learning Data Mixing Multi‑Objective Optimization