Compress-Distill: Optimizing Knowledge Distillation via Reasoning Trace Compression

Researchers introduce "Compress-Distill," a method to mitigate the computational costs and verbosity associated with distilling long chain-of-thought (CoT) reasoning traces from large-scale teacher models into smaller student models.

The Challenge of Verbose Reasoning Traces

Modern reasoning models generate extensive chain-of-thought traces to arrive at correct answers. While these traces are invaluable for knowledge distillation, their length presents two primary challenges: high computational overhead during the training of student models and a tendency for student models to mimic this verbosity, leading to inefficient and overly wordy outputs.

The Compress-Distill Methodology

The proposed approach focuses on the post-hoc compression of reasoning traces before they are used for distillation. By reducing the length of the teacher's reasoning paths while preserving the logical integrity of the solution, the researchers aim to optimize the efficiency of the distillation process.

Experimental Setup and Data Generation

The study utilized two powerful teacher models to generate a substantial dataset of correct reasoning traces:

  • Qwen3.5-397B-A17B: Generated approximately 283k correct traces.
  • gpt-oss-120B: Generated approximately 283k correct traces.

Following generation, two instruction-tuned models were employed to compress these traces. The results showed a significant reduction in size, with the compressed traces retaining only 8.6% to 21.0% of their original character length.

Performance and Ablation Studies

The efficacy of the method was evaluated through a comprehensive 48-run main grid. Additionally, the researchers conducted seven ablation studies specifically focusing on the truncation of traces from the Qwen teacher to determine the impact of length reduction on the student's learning performance.

Initial findings indicate that compressed traces significantly reduce the total number of training tokens required, thereby lowering the computational footprint of the distillation process without sacrificing the core reasoning capabilities transferred to the student model.

Note: The provided source text was truncated; specific final metrics regarding the exact reduction in training tokens and the resulting performance benchmarks of the student models were not included in the input data.

Original Source
Knowledge Distillation Chain-of-Thought (CoT) Model Compression LLM Efficiency Reasoning Models