Agents
SIMMER: Benchmarking Latent Failures in LLM Executable Planning with a World Model
The article introduces SIMMER, a benchmark designed to assess latent failures in LLM-generated planning, specifically within a kitchen domain using a human-curated symbolic world model. This model encompasses 77 actions, 262 objects, and around 46,800 interactions, and utilizes a state machine executor to identify immediate and latent failures. Results indicate that even advanced LLMs produce only 17% error-free plans, with up to 56% containing latent failures; however, implementing counterfactual foresight simulation can significantly reduce these failures, highlighting the need for improved robustness in LLM planning systems.
llmplanningbenchmark