ModelsarXiv cs.AI — 14 d ago

QMFOL: Benchmarking Large Language Model Reasoning via Quantifiable Monadic First-Order Logic Test Case Generation

QMFOL is a newly proposed framework for generating monadic first-order logic reasoning tasks, aimed at enhancing the evaluation of large language models (LLMs) in deductive reasoning. It allows for precise control over logical complexity through the creation of formal structures, which are then translated into natural language, ensuring logical consistency via external provers. The accompanying benchmark, QMFOLBench, includes 2880 instances and reveals that model performance varies significantly with logical complexity and semantic diversity, highlighting the need for more nuanced evaluation metrics in LLMs.

llmbenchmarkingreasoningrelevance 0.00 · engagement 0.00

Read at source ↗← all news