Safety
Two Wrongs, No Right: Auditing Social-Desirability Bias in LLM Annotators for Computational Social Science
This study audits three open-source 7B instruction-tuned models (Zephyr, Mistral-Instruct, Qwen2.5-Instruct) for their performance in computational social science tasks, revealing significant social-desirability biases that can distort research conclusions. Zephyr shows leniency bias with a false benign rate of 0.729 for offensive language, while Mistral and Qwen exhibit overcorrection in labeling, with Mistral's false alarm rate for hate speech at 0.604. The findings highlight that conventional prompting interventions fail to correct these biases, underscoring the need for careful validation protocols in LLM applications to ensure accurate empirical reporting.
llmbiasannotation