Research
Too long; didn't solve
This paper investigates the impact of structural length variables—prompt length and solution length—on the performance of large language models (LLMs) when evaluated on a new adversarial dataset of expert-authored mathematics problems. The study finds a positive correlation between both lengths and increased model failure rates, suggesting that longer prompts and solutions may complicate reasoning tasks. This insight is crucial for practitioners as it highlights the need to consider structural properties in the design of benchmarks and the evaluation of model performance in mathematical reasoning tasks.
mathematicsllmbenchmark