Training
OmniOPD: Logit-Free On-Policy Distillation via Speculative Verification
OmniOPD is a new framework for On-Policy Distillation (OPD) that eliminates the need for direct access to teacher token-level logits by utilizing a logit-free, chunk-level supervision signal. It employs Monte Carlo rollouts to approximate the teacher's preferences based on semantic similarity across multi-token chunks, significantly improving robustness against noise and brittleness in logit matching. Benchmarked against standard OPD, OmniOPD demonstrates up to +28.64% improvement on math tasks and achieves an additional +9.54% gain when used with advanced black-box teachers, making it a valuable advancement for practitioners seeking more reliable training signals for LLMs.
distillationon-policysupervisedreinforcement-learning