This research evaluates the efficacy of on-policy self-distillation for continual post-training using self-distillation policy optimization (SDPO). While SDPO can accelerate in-domain specialization when teacher signals are stable and aligned, the study investigates the inherent limits of this approach in preserving existing capabilities while acquiring new knowledge.
Read original
huggingface/daily-papers