Coding
DLawBench: Evaluating LLMs Through Multi-Turn Legal Consultation
DLawBench is a new diagnostic benchmark designed to evaluate Large Language Models (LLMs) in the context of multi-turn legal consultations, addressing the lack of interactive capabilities in existing legal benchmarks. It includes 461 cases from Chinese and U.S. law, with over 5,500 fact entries and 3,400 inquiry rubrics, assessing 26 LLMs, including GPT-5.5, which scored 0.562 in consultation-grounded legal reasoning. This benchmark highlights critical challenges in LLM performance during legal consultations, revealing issues such as sycophancy and decreased effectiveness when clients require the most guidance, which is vital for practitioners developing AI-driven legal assistance tools.
legal consultationbenchmarklarge language models