HuggingFace Papers Jul 3, 2026

Denser neq Better: Limits of On-Policy Self-Distillation for Continual Post-Training

What happened

This paper investigates the limits of on-policy self-distillation during continual post-training. While the method accelerates in-domain specialization, it fails to prevent catastrophic forgetting and suffers from severe performance collapse when encountering out-of-distribution (OOD) scenarios, demonstrating that on-policy data alone is insufficient for robust continual learning.

Why it matters

It highlights the critical limitations of purely synthetic, on-policy self-distillation loops for model customization.

The take

This is a cautionary tale for teams relying solely on synthetic, model-generated data loops for continuous alignment or domain adaptation. Without anchoring the model with high-quality off-policy or general-domain data, you risk creating a highly specialized model that collapses on anything outside its narrow training distribution.

Do this

When post-training or fine-tuning models on specialized tasks, ensure your data mix includes diverse off-policy or general-domain datasets to mitigate catastrophic forgetting.

Read the source →

Don't read this site daily. Get it in your inbox.

The daily brief and Sunday deep dive — distilled, scored, and opinionated. For builders only.