ResearcharXiv cs.AI — 7 d ago

SciR: A Controllable Benchmark for Scientific Reasoning in LLMs

SciR is a newly introduced benchmark designed to evaluate large language models (LLMs) on scientific reasoning through three paradigmatic forms of inference: deduction, induction, and causal abduction. It features a unique construction that allows for controlled variability in extraction and inference difficulty, utilizing formal objects to generate tasks that are rendered into multi-document scientific discourse. This benchmark is significant for practitioners as it provides a systematic way to assess and improve LLM performance in scientific contexts, revealing the nuanced capabilities and limitations of various models in handling complex reasoning tasks.

benchmarkscientific reasoningllmrelevance 0.00 · engagement 0.00

Read at source ↗← all news