Agents
Let LLMs Judge Each Other: Multi-Agent Peer-Reviewed Reasoning for Medical Question Answering
The article presents a multi-agent peer-reviewed reasoning method designed to enhance the performance of large language models (LLMs) in medical question answering (MedQA). The approach involves multiple LLMs generating and evaluating reasoning chains, with experiments conducted using five models (Llama-3.1-8B, Qwen2.5-7B, Phi-4, DeepSeek-LLM-7B, GPT-oss-20B) across three benchmark datasets. Results showed that this method achieved an average accuracy of 0.820, outperforming single model reasoning and majority voting, thereby improving accuracy, interpretability, and robustness in biomedical AI applications.
medicalquestion answeringpeer review