ai-digest.dev
last updated 1 h ago
InferencearXiv cs.AI 11 d ago

DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue

DiagFlowBench is a newly introduced dataset designed to evaluate how language models manage off-procedure inputs in grounded diagnostic dialogues, consisting of 50 industrial diagnostic flowcharts and 1,676 multi-turn conversations. The evaluation of ten commercial and open-weight models indicates significant variability in their abstention rates, with a tendency to select contextually inadequate steps rather than generating false information. This highlights a critical vulnerability in grounding systems, emphasizing the need for improved handling of out-of-scope queries in practical applications.

language modelsdiagnostic dialogueevaluationrelevance 0.00 · engagement 0.00
Read at source ↗← all news
DiagFlowBench: Evaluating How Language Models Handle Off-Procedure Inputs in Grounded Diagnostic Dialogue — AI News Digest