Agents
TMax: A Simple Recipe for Terminal Agents
TMax introduces a novel recipe for terminal reinforcement learning (RL) agents, comprising the TMax-15k dataset with 14,600 environments, which is over 2.5× larger than existing datasets, and a simplified outcome-only RL recipe that combines GRPO with stability enhancements. The TMax-9B model achieves a benchmark score of 27.2% on Terminal Bench 2.0, outperforming previous 32B models and nearing the performance of closed models, while the TMax-27B model further improves to 42.7%, challenging much larger models. This advancement provides practitioners with a robust framework and extensive dataset for developing competitive RL agents.
TMaxreinforcement_learningagents