Causal-rCM: Advancing Autoregressive Diffusion Distillation for Streaming Video and World Models

Researchers introduce Causal-rCM, a unified distillation framework that integrates teacher-forcing and self-forcing mechanisms to optimize autoregressive diffusion transformers for real-time streaming video generation and interactive world models.

Bridging the Gap in Autoregressive Video Diffusion

The emergence of causal diffusion transformers has established a powerful paradigm for generating streaming video and developing action-conditioned interactive world models. However, achieving real-time performance requires efficient distillation processes that can maintain high fidelity while reducing sampling overhead. Causal-rCM addresses these challenges by extending the rCM (Consistency Model) framework specifically for autoregressive architectures.

The Core Architecture: Complementary Divergences

The technical foundation of Causal-rCM is based on the synergy between two distinct distillation methodologies: Consistency Models (CMs) and Distribution Matching Distillation (DMD). The framework leverages the complementarity between forward and reverse divergences to stabilize the distillation process.

Teacher-Forcing vs. Self-Forcing

A critical contribution of this work is the "open recipe" for combining teacher-forcing and self-forcing. By unifying these two approaches, Causal-rCM mitigates the typical drift associated with autoregressive generation, ensuring that the model remains stable over long sequences of streaming video frames while maintaining the efficiency of a distilled model.

Applications in World Models and Streaming

By applying this distillation recipe to causal diffusion transformers, the researchers aim to enhance the capabilities of interactive world models. These models must predict future states based on current observations and specific actions in real-time, a task that demands both high temporal consistency and low latency—both of which are optimized through the Causal-rCM approach.

Note: The provided source text was truncated; specific quantitative results and detailed architectural benchmarks are not available in the provided snippet.

Original Source
Diffusion Models Autoregressive Generation Knowledge Distillation World Models Consistency Models Video Generation