Today’s top story is the introduction of **Skill-RAG**, a novel framework that enhances Retrieval-Augmented Generation (RAG) by integrating failure-awareness to improve retrieval efficiency and accuracy in complex queries (). This is complemented by the **On Cost-Effective LLM-as-a-Judge Improvement Techniques** paper, which presents scalable methods to enhance the accuracy of language model judges in reinforcement learning frameworks, achieving significant improvements (). Additionally, **Who Wrote the Book? Detecting and Attributing LLM Ghostwriters** introduces a dataset and method for evaluating authorship attribution in long-form texts generated by LLMs, enhancing transparency in AI-generated content (). These advancements highlight the ongoing efforts to improve the robustness and reliability of AI systems in practical applications.
Skill-RAG: Failure-State-Aware Retrieval Augmentation via Hidden-State Probing and Skill Routing
Skill-RAG introduces a failure-aware framework for Retrieval-Augmented Generation (RAG) that integrates a hidden-state prober and a prompt-based skill router to address misalignment between queries and evidence. By diagnosing retrieval failures and selecting from four distinct skills—query rewriting, question decomposition, evidence focusing, and an exit skill—the model enhances retrieval efficiency and accuracy, particularly on challenging open-domain QA and reasoning benchmarks. This approach is significant for practitioners as it provides a structured method to improve LLM performance in scenarios where traditional retrieval mechanisms fail, enabling more robust handling of complex queries.
arXiv cs.CL — 26 d ago · found 24 d agoRAG
2.
On Cost-Effective LLM-as-a-Judge Improvement Techniques
The paper presents four cost-effective techniques to enhance the accuracy of language model judges in reinforcement learning from human feedback (RLHF) frameworks: ensemble scoring, task-specific criteria injection, calibration context, and adaptive model escalation. Empirical results on RewardBench 2 demonstrate that ensemble scoring combined with criteria injection achieves an accuracy of 85.8%, a 13.5 percentage point improvement over the baseline, with small models benefiting significantly from these methods. This work is significant for practitioners as it provides scalable strategies for improving LLM evaluation reliability without incurring substantial costs, making high-accuracy assessments more accessible.
arXiv cs.CL — 26 d ago · found 24 d agoTraining
3.
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
The paper introduces HarDBench, a benchmark specifically designed to evaluate the vulnerability of large language models (LLMs) to draft-based co-authoring jailbreak attacks, where malicious users exploit incomplete drafts to elicit harmful outputs. It covers high-risk domains such as Explosives, Drugs, Weapons, and Cyberattacks, utilizing prompts with realistic structures to assess model susceptibility. The authors propose a safety-utility balanced alignment approach that significantly reduces harmful outputs while maintaining co-authoring performance, highlighting the need for robust evaluation frameworks in human-LLM collaborative writing.
arXiv cs.CL — 26 d ago · found 24 d agoSafety
the full briefing
Models & Releases
The introduction of **Skill-RAG** marks a significant advancement in Retrieval-Augmented Generation (RAG), focusing on failure-awareness to enhance retrieval efficiency and accuracy, especially in complex queries (). This framework is essential for practitioners looking to improve LLM performance in challenging scenarios. Additionally, the **On Cost-Effective LLM-as-a-Judge Improvement Techniques** paper presents four techniques that enhance the accuracy of language model judges in reinforcement learning from human feedback frameworks, achieving a notable accuracy improvement of 13.5 percentage points over the baseline (). These developments underscore the importance of refining evaluation methods for LLMs to ensure reliable performance in real-world applications.
Research & Safety
In the realm of research, **Who Wrote the Book? Detecting and Attributing LLM Ghostwriters** introduces GhostWriteBench, a dataset for evaluating authorship attribution in long-form texts generated by LLMs, enhancing transparency and accountability in AI-generated literature (). Furthermore, the paper **Culturally uneven urban perception in large language models** highlights the risks of deploying LLMs in urban analysis, emphasizing the need for careful consideration of cultural contexts to avoid biases (). These findings are crucial for practitioners aiming to develop fair and unbiased AI systems that accurately reflect diverse human perspectives.
Tooling & Open Source
The paper **HarDBench** introduces a benchmark designed to evaluate the vulnerability of LLMs to draft-based co-authoring jailbreak attacks, emphasizing the need for robust evaluation frameworks in human-LLM collaborative writing (). This work is particularly relevant for developers focused on enhancing the safety and utility of collaborative AI tools. Additionally, the **GhazalBench** benchmark evaluates LLMs on their understanding of Persian ghazals, revealing challenges in exact verse completions while capturing poetic meaning (GhazalBench). These benchmarks provide valuable resources for practitioners looking to improve LLM capabilities in culturally nuanced applications.