TrainingarXiv cs.CL — 15 d ago

Trust Region On-Policy Distillation

The paper introduces Trust Region On-Policy Distillation (TrOPD), a novel technique designed to enhance the stability of On-Policy Distillation (OPD) for large language models (LLMs). TrOPD employs trust-region learning to limit optimization to areas where teacher supervision is reliable, and incorporates strategies like outlier estimation and off-policy guidance to improve performance. Experimental results indicate that TrOPD surpasses state-of-the-art OPD methods in various benchmarks, making it a significant advancement for practitioners focusing on efficient model training and deployment.

on-policy-distillationllmoptimizationrelevance 0.00 · engagement 0.00

Read at source ↗← all news