AgentsarXiv cs.AI — 7 d ago

Counterfactual Credit Policy Optimization for Multi-Agent Collaboration

The paper introduces two novel credit assignment methods for multi-agent collaboration in large language models: Counterfactual Credit for Policy Optimization (CCPO) and Self-Evaluated Credit for Policy Optimization (SEPO). CCPO estimates an agent's contribution by contrasting joint outcomes with counterfactual scenarios, while SEPO employs self- and peer-evaluations to generate agent-specific rewards. Evaluations on mathematical reasoning benchmarks, including MATH500, demonstrate that these methods enhance dual-agent reasoning performance, highlighting their potential to improve reinforcement learning in collaborative AI systems.

multi-agentreinforcement-learningcredit-assignmentrelevance 0.00 · engagement 0.00

Read at source ↗← all news