Research
CombEval: A Framework for Evaluating Combinatorial Counting in Large Language Models
CombEval is a newly introduced dynamic benchmark designed for evaluating combinatorial counting capabilities in large language models (LLMs). It utilizes a typed Cofola specification to systematically generate natural-language counting problems, allowing for controlled variations in object types and constraints. The evaluation of 11 LLMs reveals persistent weaknesses in handling ordered objects and nested dependencies, making CombEval a valuable tool for diagnosing and understanding the limitations of LLMs in combinatorial reasoning tasks.
benchmarkingcombinatorialllm