Agents
Counterfactual Credit Policy Optimization for Multi-Agent Collaboration
The paper introduces two novel credit assignment methods for multi-agent collaboration in large language models: Counterfactual Credit for Policy Optimization (CCPO) and Self-Evaluated Credit for Policy Optimization (SEPO). CCPO estimates an agent's contribution by contrasting joint outcomes with counterfactual scenarios, while SEPO employs self- and peer-evaluations to generate agent-specific rewards. Evaluations on mathematical reasoning benchmarks, including MATH500, demonstrate that these methods enhance dual-agent reasoning performance, highlighting their potential to improve reinforcement learning in collaborative AI systems.
multi-agentreinforcement-learningcredit-assignment