Research
M\"OVE: A Holistic LLM Benchmark for the German Public Sector
M\"OVE (Modelle für die Öffentliche Verwaltung Evaluieren) is a new benchmark for assessing large language models (LLMs) specifically tailored for the German public sector, evaluating 39 models across performance and governance criteria. It employs ten German-language datasets and utilizes a multi-metric evaluation strategy, revealing that no single model excels across all tasks and that model size is not a reliable quality indicator. This benchmark is significant for practitioners as it provides a comprehensive framework for model selection in public administration, addressing existing gaps in the evaluation landscape.
benchmarkpublic-sectorllm