InferencearXiv cs.AI — 11 d ago

DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue

DiagFlowBench is a newly introduced dataset designed to evaluate how language models manage off-procedure inputs in grounded diagnostic dialogues, consisting of 50 industrial diagnostic flowcharts and 1,676 multi-turn conversations. The evaluation of ten commercial and open-weight models indicates significant variability in their abstention rates, with a tendency to select contextually inadequate steps rather than generating false information. This highlights a critical vulnerability in grounding systems, emphasizing the need for improved handling of out-of-scope queries in practical applications.

language modelsdiagnostic dialogueevaluationrelevance 0.00 · engagement 0.00

Read at source ↗← all news