SafetyarXiv cs.AI — 47 d ago

PhantomBench: Benchmarking the Non-existential Threat of Language Models

PhantomBench, a new benchmark introduced in arXiv:2606.11105v1, evaluates the hallucination rates of 21 language models across diverse domains using over 60,000 non-existent terms and entities. The benchmark reveals alarmingly high hallucination rates, with averages reaching 86.7%, highlighting the inability of even advanced models to recognize non-existent concepts. This tool not only aids in assessing model behavior regarding rare concepts but also provides a scalable pipeline for generating tailored non-existent concepts, which is crucial for practitioners aiming to mitigate risks associated with model hallucinations.

hallucinationslanguage modelsbenchmarkingrelevance 0.60 · engagement 0.00

Read at source ↗← all news