Research
Hard or Just Unreached? Diagnosing the Sampling Blind Spot in Math-Reasoning Difficulty Estimation
The paper presents a critique of the pass@k metric used in math and science reasoning benchmarks, highlighting its limitations in estimating the difficulty of certain problems. Through experiments with eight free-form math datasets (GSM8K and MATH) across four open-weight models, it reveals that 10.3-22.9% of problems that a sampling method fails to solve can be addressed using a six-chain deterministic decoding approach combined with activation grafting. This finding suggests that the inability to solve these problems is not due to inherent difficulty but rather a blind spot in sampling methods, emphasizing the need for alternative strategies in model evaluation and training for improved performance in challenging reasoning tasks.
math reasoningdifficulty estimationsampling