HuggingFace Papers Jul 1, 2026

DOPD: Dual On-policy Distillation

reasoning

What happened

Dual On-policy Distillation (DOPD) addresses the 'privilege illusion' in LLM and VLM distillation, where student models struggle to learn from teacher policies due to mismatched capabilities. DOPD dynamically routes token-level supervision between the teacher and student based on advantage gaps and probabilities during on-policy training.

Why it matters

It optimizes how smaller models inherit complex reasoning capabilities from larger frontier models during distillation.

The take

Distillation is becoming the primary way smaller, local models inherit reasoning capabilities from frontier models. DOPD's dynamic routing approach is a clever way to prevent student models from getting confused by teacher outputs they aren't equipped to mimic yet. This is highly relevant if you are fine-tuning or distilling custom reasoning models.

Do this

Read the paper if you are actively training, fine-tuning, or distilling custom small language models or vision-language models.

Read the source →

Don't read this site daily. Get it in your inbox.

The daily brief and Sunday deep dive — distilled, scored, and opinionated. For builders only.