Inference
DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue
DiagFlowBench is a newly introduced dataset designed to evaluate how language models manage off-procedure inputs in grounded diagnostic dialogues, consisting of 50 industrial diagnostic flowcharts and 1,676 multi-turn conversations. The evaluation of ten commercial and open-weight models indicates significant variability in their abstention rates, with a tendency to select contextually inadequate steps rather than generating false information. This highlights a critical vulnerability in grounding systems, emphasizing the need for improved handling of out-of-scope queries in practical applications.
language modelsdiagnostic dialogueevaluation