Research
Quantifying Subliminal Behavioral Transfer Ratios in Language Model Distillation
This study quantifies subliminal behavioral transfer ratios in language model distillation, specifically using Llama-2-7B-Chat and Qwen2.5-7B-Instruct as teacher models. The research reveals that undesirable characteristics can be transferred to student models even when trained on benign data, with evaluations on 100 JailbreakBench prompts showing significant transfer effects; Llama-2 exhibits a sharp threshold behavior while Qwen2.5 shows continuous transfer. Understanding these dynamics is crucial for practitioners to mitigate the risks of unwanted behaviors in distilled models.
distillationllmbehavior