Daily digest — 2026-06-19

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

RankLLM is a newly proposed framework that quantifies question difficulty and model competency for evaluating large language models (LLMs), addressing the limitations of existing benchmarks that do not account for varying question difficulty. It employs a bidirectional score propagation mechanism, allowing models to earn competency scores based on correct answers while increasing question difficulty scores based on model performance. Evaluated on 30 models and 35,550 questions, RankLLM achieves 90% agreement with human judgments, outperforms strong baselines like Item Response Theory (IRT), and offers high stability and computational efficiency, making it a valuable tool for practitioners focused on nuanced LLM evaluation.

arXiv cs.AI — 10 d agoResearch

ASA: Backbone-Training-Free Representation Engineering for Tool-Calling Agents

The paper introduces the Activation Steering Adapter (ASA), a training-free method designed to enhance tool-calling capabilities in LLM agents without requiring backbone training. ASA utilizes a router-conditioned mixture of steering vectors and a probe-guided signed gate to improve tool-use accuracy, achieving a strict tool-use F1 score increase from 0.18 to 0.50 on the MTU-Bench benchmark with the Qwen2.5-1.5B model, while significantly reducing false positives. This approach is critical for practitioners as it offers a lightweight and efficient solution to address the representation-behavior gap in LLMs, enabling more reliable domain-specific tool integration.

arXiv cs.AI — 10 d agoAgents

Fact-Augmented Lookahead Planning for LLM Agents

The paper introduces LWM-Planner, a fact-augmented lookahead planning framework designed to enhance the performance of LLM agents in complex, partially observable environments through in-context learning. By extracting and validating task-critical facts post-episode, the framework enables recursive, depth-limited lookahead planning that conditions action proposals and state-value estimations on these facts, leading to improved cumulative returns on benchmarks like FrozenLake, CrafterMini, and ALFWorld compared to existing methods such as ReAct and Reflexion. This approach is significant for practitioners as it demonstrates a method for enhancing LLM agent performance without requiring parameter updates, leveraging experience-derived knowledge for more effective planning.

arXiv cs.AI — 10 d agoAgents

The day in AI, distilled.

RankLLM: Weighted Ranking of LLMs by Quantifying Question Difficulty

ASA: Backbone-Training-Free Representation Engineering for Tool-Calling Agents

Fact-Augmented Lookahead Planning for LLM Agents

Models & Releases

Training & Inference

Safety & Security