Safety
The Distributed Detectability Band Against Marginal-Preserving Attacks
The paper presents a novel marginal-preserving, correlation-encoded distributed-sabotage attack against AI-control monitors, utilizing a Gaussian-copula AR(1) construction. It demonstrates that traditional per-step monitors (Monitor A) can be defeated while maintaining benign scores, achieving a KS-distance of 0.013, and shows that alternative monitors (Monitor B) can maintain high AUC values (0.79-0.97) in detecting attacks, highlighting the limitations of distribution-shape monitors. This work is significant for AI practitioners as it underscores the need for robust temporal-correlation monitoring to effectively detect subtle sabotage, which can evade conventional monitoring techniques.
AI safetymonitoringattack