TrainingarXiv cs.AI — 15 d ago

MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation

The article introduces MENTOR, a novel reinforcement learning framework designed to distill tool-use capabilities from large language models (LLMs) into smaller language models (SLMs). MENTOR employs a flexible reward structure that balances behavioral alignment with downstream performance, addressing the limitations of supervised fine-tuning (SFT) and traditional reinforcement learning approaches. Experimental results indicate that MENTOR significantly enhances out-of-domain tool-use performance in controlled executable-tool benchmarks, suggesting its potential for developing more adaptable SLMs in practical applications.

reinforcement_learningtool_usedistillationrelevance 0.00 · engagement 0.00

Read at source ↗← all news