ResearcharXiv cs.AI — 15 d ago

Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation

The paper presents a critique of the pass@k metric used in math and science reasoning benchmarks, highlighting its limitations in estimating the difficulty of certain problems. Through experiments with eight free-form math datasets (GSM8K and MATH) across four open-weight models, it reveals that 10.3-22.9% of problems that a sampling method fails to solve can be addressed using a six-chain deterministic decoding approach combined with activation grafting. This finding suggests that the inability to solve these problems is not due to inherent difficulty but rather a blind spot in sampling methods, emphasizing the need for alternative strategies in model evaluation and training for improved performance in challenging reasoning tasks.

math reasoningdifficulty estimationsamplingrelevance 0.00 · engagement 0.00

Read at source ↗← all news