HuggingFace Papers
Denser neq Better: Limits of On-Policy Self-Distillation for Continual Post-Training
What happened
This paper investigates the limits of on-policy self-distillation during continual post-training. While the method accelerates in-domain specialization, it fails to prevent catastrophic forgetting and suffers from severe performance collapse when encountering out-of-distribution (OOD) scenarios, demonstrating that on-policy data alone is insufficient for robust continual learning.
Why it matters
It highlights the critical limitations of purely synthetic, on-policy self-distillation loops for model customization.
The take
This is a cautionary tale for teams relying solely on synthetic, model-generated data loops for continuous alignment or domain adaptation. Without anchoring the model with high-quality off-policy or general-domain data, you risk creating a highly specialized model that collapses on anything outside its narrow training distribution.
Do this
When post-training or fine-tuning models on specialized tasks, ensure your data mix includes diverse off-policy or general-domain datasets to mitigate catastrophic forgetting.
Don't read this site daily. Get it in your inbox.
The daily brief and Sunday deep dive — distilled, scored, and opinionated. For builders only.