ModelsarXiv cs.AI — 21 h ago

T1-Bench: Benchmarking Multi-Scenario Agents in Real-World Domains

T1-Bench is a newly introduced benchmark designed to evaluate agentic systems in complex, multi-domain environments, addressing limitations in existing benchmarks regarding task complexity and realism. It encompasses 25 diverse domains and features interleaved scenarios that require structured reasoning and multi-turn interactions, assessed through 12 models, including both proprietary and open-weight variants. This benchmark enhances the evaluation of agent behavior and tool utilization, and will be publicly available as open source, providing a standardized framework for researchers and practitioners in the field of AI.

benchmarkingagentsmulti-domainrelevance 0.00 · engagement 0.00

Read at source ↗← all news