MENTOR: Reinforcement Learning via Flexible Teacher-Optimized Rewards for Tool-Use Distillation
arXiv:2510.18383v3 Announce Type: replace-cross Abstract: Distilling the tool-use capabilities of large language models (LLMs) into small language models (SLMs) is essential for their practical application. The predominant approach, supervised fine-tuning (SFT), suffers from poor out-of-domain...
The Teacher Becomes the Algorithm: MENTOR’s New Paradigm for Tool-Use Distillation
The challenge of compressing the reasoning and tool-use capabilities of large language models (LLMs) into smaller, more deployable models has long been a bottleneck for practical AI. Supervised fine-tuning (SFT), the standard method, often fails when the student model encounters scenarios outside its narrow training distribution. A new paper, MENTOR, proposes a fundamentally different approach: instead of having a static teacher model simply provide demonstrations, it uses reinforcement learning (RL) to dynamically optimize the teacher’s reward function specifically for the student’s learning process.
What Happened
The MENTOR framework reframes knowledge distillation as a two-level optimization problem. At the inner level, a small student model is trained via RL to maximize a reward signal for correct tool use. At the outer level, a larger teacher model (or a learned reward model) is itself optimized to produce rewards that maximize the student’s final performance. This is not about the teacher showing the student the “right answer” (as in SFT), but about the teacher learning to be a better coach—adjusting its feedback based on what the student struggles with. The paper demonstrates that this adaptive, teacher-optimized reward leads to significantly better generalization, especially in out-of-domain tool-use tasks where SFT-based students typically collapse.
Why It Matters
This work addresses a core weakness of distillation: the assumption that a good teacher’s demonstrations are inherently good for every student. In reality, a student model’s limited capacity means it cannot perfectly mimic a teacher. SFT forces the student to learn a static mapping, which breaks when the input varies. MENTOR’s key insight is that the reward signal itself should be a learnable, student-aware function. This has profound implications:
- Beyond Imitation: The field moves from “copy the expert” to “learn from a teacher who adapts to you.” This mirrors how human tutoring works—a good tutor doesn’t just give answers, they adjust their feedback based on the learner’s mistakes.
- Robustness to Distribution Shift: By optimizing the teacher’s rewards for the student’s performance on challenging, unseen tasks, MENTOR directly tackles the out-of-domain generalization problem that plagues SFT.
- Efficiency of the Optimization Loop: While computationally more expensive than a single SFT pass, MENTOR’s bilevel optimization is more sample-efficient in the long run because the teacher’s reward function becomes a reusable asset for training multiple student models or for continual learning.
Implications for AI Practitioners
For teams deploying small models for tool use (e.g., code execution, API calls, database queries), this paper signals a shift in best practices. The immediate takeaway is that static distillation data is a liability. Practitioners should consider:
- Investing in reward model infrastructure: The teacher model in MENTOR effectively becomes a learned reward function. Building or fine-tuning a model to provide adaptive feedback for a specific student architecture may yield better results than collecting massive datasets of perfect demonstrations.
- Rethinking evaluation: Success on a held-out test set of tool-use examples is insufficient. MENTOR’s value is proven in out-of-domain scenarios. Teams should stress-test their distilled models on tasks that deliberately differ from the training distribution.
- Computational trade-offs: MENTOR requires running RL for the student and optimizing the teacher’s reward parameters. This is not a lightweight alternative to SFT. It is a strategic investment for applications where reliability and generalization are critical, such as autonomous agents or production pipelines.
Key Takeaways
- MENTOR replaces static supervised fine-tuning with a bilevel RL framework where a teacher model learns to generate student-specific rewards, not just demonstrations.
- The primary breakthrough is improved out-of-domain generalization for tool-use tasks, addressing a critical failure mode of standard distillation.
- For practitioners, this implies a need to move from static dataset collection to adaptive reward infrastructure, particularly for high-stakes agentic applications.
- The approach is computationally heavier than SFT but offers better sample efficiency and robustness, making it suitable for scenarios where model reliability in novel situations is paramount.