Training
MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation
The article introduces MENTOR, a novel reinforcement learning framework designed to distill tool-use capabilities from large language models (LLMs) into smaller language models (SLMs). MENTOR employs a flexible reward structure that balances behavioral alignment with downstream performance, addressing the limitations of supervised fine-tuning (SFT) and traditional reinforcement learning approaches. Experimental results indicate that MENTOR significantly enhances out-of-domain tool-use performance in controlled executable-tool benchmarks, suggesting its potential for developing more adaptable SLMs in practical applications.
reinforcement_learningtool_usedistillation