SafetyarXiv cs.AI — 7 d ago

Prefill Awareness in Large Language Models

The study investigates "prefill awareness" in large language models, focusing on their ability to detect and respond to tampered outputs. The findings reveal that models like Claude Opus 4.5 can identify prefills that contradict their preferences in 9-35% of cases without generating false positives, highlighting a significant confound for safety-relevant evaluations. This capability suggests that practitioners need to account for prefill awareness when designing alignment and control protocols, as it may impact the effectiveness of these methods.

prefill awarenessllmsafety evaluationrelevance 0.00 · engagement 0.00

Read at source ↗← all news