Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Researchers propose a novel method to convert Autoregressive Language Models (ARLMs) into Diffusion Language Models (DLMs) using on-policy distillation to mitigate distribution shifts and preserve pre-trained knowledge.

Bridging the Gap Between AR and Diffusion Architectures

The transition from Autoregressive Language Models (ARLMs) to Diffusion Language Models (DLMs) represents a significant architectural shift. While ARLMs rely on causal attention for next-token prediction, DLMs utilize bidirectional attention to generate text through a denoising process. Traditional methods of conversion typically involve replacing the causal attention masks of a pre-trained ARLM with bidirectional attention and subsequently retraining the model using a DLM objective.

Addressing Distribution Shifts in Model Transformation

The authors identify two critical distribution shifts that hinder the efficiency of standard conversion processes:

  • Objective Shift: Moving from a next-token prediction objective to a diffusion-based objective can lead to the loss of critical knowledge acquired by the ARLM during its initial pre-training phase.
  • Sampling Shift: Standard DLMs often suffer from a discrepancy between the training distribution and the distribution encountered during inference (sampling), which can degrade overall performance.

On-Policy Distillation for Data Efficiency

To overcome these challenges, the research explores the use of on-policy distillation. By leveraging the pre-existing knowledge of the ARLM, the proposed approach aims to create a more data-efficient pipeline for transforming these models into DLMs, reducing the amount of retraining required while maintaining the model's linguistic capabilities.

Note: The provided source text is a partial abstract. Detailed experimental results and the specific implementation of the on-policy distillation mechanism are not available in the provided snippet.

Original Source
Diffusion Language Models Autoregressive Models Knowledge Distillation Model Transformation NLP