Safety
HarDBench: A Benchmark for Draft-Based Co-Authoring Jailbreak Attacks for Safe Human-LLM Collaborative Writing
The paper introduces HarDBench, a benchmark specifically designed to evaluate the vulnerability of large language models (LLMs) to draft-based co-authoring jailbreak attacks, where malicious users exploit incomplete drafts to elicit harmful outputs. It covers high-risk domains such as Explosives, Drugs, Weapons, and Cyberattacks, utilizing prompts with realistic structures to assess model susceptibility. The authors propose a safety-utility balanced alignment approach that significantly reduces harmful outputs while maintaining co-authoring performance, highlighting the need for robust evaluation frameworks in human-LLM collaborative writing.
llmjailbreakcollaborativewritingbenchmark