Safety
Cordyceps: Covert Control Attacks on LLMs via Data Poisoning
The paper introduces a novel data poisoning method for large language models (LLMs) that enables covert control attacks by leveraging semantic associations to encode malicious instructions. Evaluated across five LLMs, three backdoor defenses, and four prompt injection defenses, the method shows a 40% improvement in average attack success rate over traditional prompt injection attacks, achieving up to 93% success against backdoor defenses and 98% against prompt injection defenses. This research highlights a significant vulnerability in LLMs, emphasizing the need for enhanced defenses against sophisticated data poisoning techniques.
datapoisoningllm