Research
Contemporary AI lacks the imagination to diverge or negate in science
A recent study evaluates the capabilities of large language models (LLMs) in generating scientific hypotheses, involving input from 6,749 scientists on 25,139 ratings of LLM-generated ideas across various disciplines. Key findings indicate that non-reasoning LLMs produce a narrow range of similar ideas, while reasoning models explore a broader hypothesis space but fail to propose null hypotheses, which humans do more freely. The study also introduces a Qwen3-14B reward model, which, after fine-tuning on human ratings, outperforms state-of-the-art models by up to 27% and aligns more closely with expert judgment, highlighting the necessity of human involvement in scientific AI applications.
evaluationllmsciencecreativity