Research
RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty
RankLLM is a newly proposed framework that quantifies question difficulty and model competency for evaluating large language models (LLMs), addressing the limitations of existing benchmarks that do not account for varying question difficulty. It employs a bidirectional score propagation mechanism, allowing models to earn competency scores based on correct answers while increasing question difficulty scores based on model performance. Evaluated on 30 models and 35,550 questions, RankLLM achieves 90% agreement with human judgments, outperforms strong baselines like Item Response Theory (IRT), and offers high stability and computational efficiency, making it a valuable tool for practitioners focused on nuanced LLM evaluation.
llmrankingevaluation