Research
Recurrent Reasoning on Symbolic Puzzles with Sequence Models
The authors introduce RecurrReason, a benchmark for evaluating reasoning in symbolic puzzles, featuring 10,817 unique puzzles across four types with a difficulty parameter ranging from 1 to 10. They benchmark two Transformer architectures, T5 (encoder-decoder) and GPT-2 (decoder-only), highlighting that fine-tuned T5 achieves 97.27% validation accuracy but struggles with out-of-distribution tasks, notably scoring 0% on River Crossing. This work underscores the importance of architectural choices over model scale in performance on reasoning tasks, which is critical for practitioners developing robust AI systems.
reasoningsymbolic puzzlessequence models