Research
Evalatro: an open benchmark where LLMs play the real Balatro
Evalatro is an open benchmark designed for evaluating LLMs in the context of the game Balatro, allowing models to play the game autonomously without tactical hints. Key features include fixed seeds for reproducibility, a public leaderboard, and a scoring system managed by the server to prevent cheating. This benchmark is significant for AI practitioners as it provides a structured environment to assess model performance in a gaming context, facilitating insights into LLM capabilities and limitations in decision-making scenarios.
benchmarkllmevalatro