Agents
Reward Modeling for Multi-Agent Orchestration
The article introduces Orchestration Reward Modeling (OrchRM), a self-supervised framework designed to enhance the training of orchestrators in Multi-Agent Systems (MAS) utilizing Large Language Models (LLMs). OrchRM utilizes win-lose pairs derived from multi-agent executions for training a Bradley-Terry reward model, achieving up to 10x improvement in training efficiency and 8% enhancement in test-time scaling accuracy across various domains without relying on costly sub-agent rollouts. This approach offers a scalable solution for practitioners aiming to develop more efficient and robust orchestration methods in MAS.
reward-modelingmulti-agentorchestration