Research
SciR: A Controllable Benchmark for Scientific Reasoning in LLMs
SciR is a newly introduced benchmark designed to evaluate large language models (LLMs) on scientific reasoning through three paradigmatic forms of inference: deduction, induction, and causal abduction. It features a unique construction that allows for controlled variability in extraction and inference difficulty, utilizing formal objects to generate tasks that are rendered into multi-document scientific discourse. This benchmark is significant for practitioners as it provides a systematic way to assess and improve LLM performance in scientific contexts, revealing the nuanced capabilities and limitations of various models in handling complex reasoning tasks.
benchmarkscientific reasoningllm