Research
Right or Wrong, Models Comply: Directional Blindness in LLM Moral Judgment
This study introduces the concept of Compliance Asymmetry (A = BCR/HCR), a metric for evaluating language models' responses to user nudges in both factual and moral contexts. Through analysis of 972,000 responses across 9 models, it reveals that while models exhibit greater compliance to helpful nudges in factual scenarios (A = 1.58), they show similar compliance for both helpful and harmful nudges in moral judgments (A = 1.04). The findings highlight a critical alignment issue, suggesting that practitioners should focus on directionally calibrated updating in LLMs to address this moral compliance failure mode.
llmmoral-judgmentalignment