HuggingFace Papers
DOPD: Dual On-policy Distillation
reasoning
What happened
Dual On-policy Distillation (DOPD) addresses the 'privilege illusion' in LLM and VLM distillation, where student models struggle to learn from teacher policies due to mismatched capabilities. DOPD dynamically routes token-level supervision between the teacher and student based on advantage gaps and probabilities during on-policy training.
Why it matters
It optimizes how smaller models inherit complex reasoning capabilities from larger frontier models during distillation.
The take
Distillation is becoming the primary way smaller, local models inherit reasoning capabilities from frontier models. DOPD's dynamic routing approach is a clever way to prevent student models from getting confused by teacher outputs they aren't equipped to mimic yet. This is highly relevant if you are fine-tuning or distilling custom reasoning models.
Do this
Read the paper if you are actively training, fine-tuning, or distilling custom small language models or vision-language models.
Don't read this site daily. Get it in your inbox.
The daily brief and Sunday deep dive — distilled, scored, and opinionated. For builders only.