ResearcharXiv cs.AI — 7 d ago

AFFORDANCE20Q: Evaluating Affordance Reasoning from Physical Properties

The article introduces Affordance20Q, a new benchmark for evaluating affordance reasoning in Large Language Models (LLMs), designed as a 20-Questions game that conceals object identities to prevent reliance on memorization. The benchmark includes 1,009 games across 454 objects and 59 affordances, revealing a performance gap of approximately 20 points between LLMs and human participants. To enhance model performance, the authors propose KB-Anchored Rule Induction (KARI), which improves open-source LLMs by up to 15.2 points by generating affordance rules based on knowledge bases, though limited KB coverage restricts further improvements.

planningMonte Carlo Tree Searchcausal modelsrelevance 0.00 · engagement 0.00

Read at source ↗← all news