Safety
Do Safety Monitors Stay Reliable After an Update? Benchmarking and Predicting Activation-Monitor Staleness
This study presents a systematic evaluation of the reliability of activation monitors—lightweight probes used in AI safety stacks—after routine updates to language models. The findings indicate that while quantization updates generally maintain monitor performance, fine-tuning often leads to staleness, particularly affecting privacy-related probes. The research highlights the need for revalidation of activation monitors post-fine-tuning and introduces a predictive approach to prioritize which monitors require checks, thereby informing best practices for maintaining safety in deployed models.
monitoringreliabilityactivationupdates