ResearcharXiv cs.AI — 10 d ago

MBABench: Evaluating LLM Agents on End-to-End Spreadsheet Tasks in Finance

The article introduces MBABench, a new benchmark for evaluating LLM agents on end-to-end spreadsheet tasks specifically in finance, addressing a gap in existing benchmarks that focus on simpler tasks. The evaluation framework includes three dimensions: Accuracy, Formula, and Format, assessing the quality of outputs based on professional standards. Results indicate that while the Claude family of models performs best in producing professional-looking spreadsheets, they still struggle with complex workflows, highlighting the need for further advancements in LLM capabilities for practical financial applications.

llmagentsspreadsheetrelevance 0.00 · engagement 0.00

Read at source ↗← all news