Taylor-Calibrate: Principled Initialization for Hybrid Linear Attention Distillation

Researchers introduce Taylor-Calibrate, a principled initialization method designed to stabilize and improve the distillation of pretrained Transformers into hybrid linear attention models, specifically targeting Gated DeltaNet (GDN) architectures to reduce inference costs without sacrificing model quality.

Overcoming the Quadratic Bottleneck in Long-Context Inference

Hybrid linear attention models have emerged as a promising alternative to standard Transformer architectures. By addressing the quadratic computational complexity and the substantial memory overhead associated with the KV-cache in full softmax attention, these models enable faster and more efficient long-context inference. However, the primary challenge remains the high cost of pretraining these architectures from scratch.

The Challenge of Model Conversion

To bypass the need for expensive pretraining, researchers often attempt to convert existing pretrained Transformers into linear attention students. While this distillation process is theoretically appealing, it has historically proven to be brittle. A common naive approach—simply copying the teacher's attention projections into a Gated DeltaNet (GDN) student—fails to adequately specify the necessary parameters, leading to suboptimal performance and instability during the transition.

Introducing Taylor-Calibrate

The proposed "Taylor-Calibrate" method provides a principled framework for initialization during the distillation process. Rather than relying on direct weight copying, this approach utilizes a calibration strategy (likely based on Taylor expansions, as suggested by the title) to better align the student's linear attention mechanism with the teacher's softmax attention behavior. This ensures a more stable transition and preserves the quality of the original pretrained model while gaining the efficiency of the linear architecture.

Note: Due to the limited description provided, specific mathematical details of the Taylor-Calibrate algorithm and the exact quantitative results are not available in this summary.

Original Source
Linear Attention Knowledge Distillation Gated DeltaNet Model Compression Long-Context Inference