ResearcharXiv cs.CL — 2 d ago

UnpredictaBench: A Benchmark for Evaluating Distributional Randomness in LLMs

UnpredictaBench has been introduced as a benchmark to evaluate the ability of large language models (LLMs) to capture true underlying distributions, addressing the issue of models collapsing towards single plausible answers. The benchmark includes 448 problems and employs the KS@N metric, based on the Kolmogorov-Smirnov test, to quantify how well models approximate target distributions, revealing that no tested model exceeds 40% accuracy at KS@100, indicating significant room for improvement in distributional sampling. This framework is critical for practitioners aiming to utilize LLMs in simulations of complex systems, as it highlights the challenges in achieving reliable output diversity and distributional fidelity.

llmbenchmarkrandomnessevaluationrelevance 0.00 · engagement 0.00

Read at source ↗← all news