ResearcharXiv cs.AI — 7 d ago

How reliable are LLMs when it comes to playing dice?

This study evaluates the probabilistic reasoning abilities of eight state-of-the-art large language models (LLMs) using two datasets focused on standard and counterintuitive discrete probability problems. The models achieved an average accuracy of 0.96 on standard problems but only 0.59 on counterintuitive ones, revealing a significant performance drop due to token bias and misleading prompt variations. These findings highlight the limitations of current LLMs in genuine probabilistic reasoning, which is crucial for practitioners developing applications that rely on robust decision-making under uncertainty.

probabilistic reasoningllmbenchmarkingrelevance 0.00 · engagement 0.00

Read at source ↗← all news