Research
TW-LegalBench: Measuring Taiwanese Legal Understanding
The article introduces TW-LegalBench, a benchmark designed to evaluate large language models (LLMs) on Taiwanese legal understanding using a comprehensive dataset that includes over 16,000 multiple-choice questions, 117 open-ended essay questions, and more than 14,000 legal judgment prediction instances. The evaluation of 13 LLMs reveals that while some models achieve passing rates for legal professionals, they still fall short for judges and prosecutors, particularly in accurately citing legal articles. This benchmark is significant for practitioners as it underscores the limitations of LLMs in jurisdiction-specific legal reasoning, emphasizing the need for improved models in generating reliable legal text.
llmlegalbenchmark