AgentsarXiv cs.AI — 10 d ago

SorryDB: Can AI Provers Complete Real-World Lean Theorems?

SorryDB is a newly released dynamically-updating benchmark comprising open Lean tasks sourced from 78 real-world formalization projects on GitHub, designed to better align AI tools with community needs. The evaluation of various approaches, including generalist large language models and specialized symbolic provers, indicates that while the agentic approach using Gemini Flash shows superior performance, it does not outperform all other methods, highlighting the complementary nature of different AI techniques in tackling complex formal mathematics tasks. This benchmark is significant for practitioners as it addresses test-set contamination and provides a more relevant metric for assessing AI contributions to formalization efforts.

leanformalizationagentsairelevance 0.00 · engagement 0.00

Read at source ↗← all news