ResearcharXiv cs.AI — 7 d ago

MA-ProofBench: A Two-Tiered Evaluation of LLMs for Theorem Proving in Mathematical Analysis

The article introduces MA-ProofBench, a novel benchmark for evaluating Large Language Models (LLMs) in theorem proving specifically within the domain of Mathematical Analysis. It comprises 200 formalized theorems across various topics, divided into undergraduate and Ph.D. qualifying levels, to assess LLMs' reasoning capabilities. Initial evaluations show that even the top-performing model, GPT-5.5, achieves only 16% success on the easier problems, highlighting significant challenges in formal reasoning and the need for improved models in complex mathematical domains.

llmtheorem provingmathematical analysisbenchmarkrelevance 0.00 · engagement 0.00

Read at source ↗← all news