ResearcharXiv cs.AI — 9 d ago

On the Limits of LLM-as-Judge for Scientific Novelty Assessment

The paper introduces RQ-Bench, a benchmark for evaluating research questions (RQs) derived from arXiv papers, aimed at assessing the novelty of scientific ideas using large language models (LLMs). The study reveals that while LLM judges tend to rate model-generated RQs as highly novel, human experts prefer author-anchored reference questions, highlighting a significant discrepancy in novelty assessments. This raises concerns about the reliability of LLMs in scientific novelty evaluation, emphasizing the need for careful consideration when integrating LLMs in research ideation processes.

LLMscientific noveltyevaluationrelevance 0.00 · engagement 0.00

Read at source ↗← all news