Training
WorkflowPerturb: Calibrated Stress Tests for Evaluating Multi-Agent Workflow Metrics
WorkflowPerturb introduces a benchmark for evaluating multi-agent LLM systems that generate structured workflows, addressing the challenge of change management in production environments. The benchmark includes 4,973 golden workflows and 44,757 perturbed variants across three types of perturbations (Missing Steps, Compressed Steps, and Description Changes) with severity levels of 10%, 30%, and 50%. This resource allows practitioners to assess the sensitivity and calibration of various workflow evaluation metrics, facilitating safer deployment decisions when updates are made to underlying models or orchestration code.
workflowevaluationmulti-agent