Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Researchers propose a novel method to convert Autoregressive Language Models (ARLMs) into Diffusion Language Models (DLMs) using on-policy distillation to mitigate distribution shifts and preserve pre-trained knowledge.

Bridging the Gap Between AR and Diffusion Architectures

The transition from Autoregressive Language Models (ARLMs) to Diffusion Language Models (DLMs) represents a significant architectural shift. While ARLMs rely on causal attention for next-token prediction, DLMs utilize bidirectional attention to generate text through a denoising process. Traditional methods of conversion typically involve replacing the causal attention masks of a pre-trained ARLM with bidirectional attention and subsequently retraining the model using a DLM objective.

Addressing Distribution Shifts in Model Transformation

The authors identify two critical distribution shifts that hinder the efficiency of standard conversion processes:

Objective Shift: Moving from a next-token prediction objective to a diffusion-based objective can lead to the loss of critical knowledge acquired by the ARLM during its initial pre-training phase.
Sampling Shift: Standard DLMs often suffer from a discrepancy between the training distribution and the distribution encountered during inference (sampling), which can degrade overall performance.

On-Policy Distillation for Data Efficiency

To overcome these challenges, the research explores the use of on-policy distillation. By leveraging the pre-existing knowledge of the ARLM, the proposed approach aims to create a more data-efficient pipeline for transforming these models into DLMs, reducing the amount of retraining required while maintaining the model's linguistic capabilities.

Note: The provided source text is a partial abstract. Detailed experimental results and the specific implementation of the on-policy distillation mechanism are not available in the provided snippet.

Original Source

Diffusion Language Models Autoregressive Models Knowledge Distillation Model Transformation NLP

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Data-Efficient Autoregressive-to-Diffusion Language Models via On-Policy Distillation

Bridging the Gap Between AR and Diffusion Architectures

Addressing Distribution Shifts in Model Transformation

On-Policy Distillation for Data Efficiency

Related Articles

The Prefill Wall: Why MTP's 2 Barely Moves Long-Context Latency (Qwen3.6-27B, RTX 3090)

openvinotoolkit /openvino

Without open llm competition, closed source LLM companies will become insatiable.

Furiosa AI selling inference chip to consumer market will be a game changer to local llm

If Claude Fable stops helping you, you'll never know