This research evaluates the efficacy of on-policy self-distillation for continual post-training using self-distillation policy optimization (SDPO). While SDPO can accelerate in-domain specialization when teacher signals are stable and aligned, the study investigates the inherent limits of this approach in preserving existing capabilities while acquiring new knowledge.

Read original