Adapting Reinforcement Learning with Chain-of-Thought Supervision for Explainable Detection of Hateful and Propagandistic Memes
The paper introduces a reinforcement learning-based post-training method for enhancing the classification and explanation quality of thinking-based multimodal large language models (MLLMs) in the context of detecting hateful and propagandistic memes. Key contributions include the use of Group Relative Policy Optimization (GRPO) with thinking-length regularization, an empirical study across English and Arabic benchmarks, and the extension of existing meme datasets with weakly supervised chain-of-thought rationales. Results indicate improvements in classification accuracy (up to +2.1% on the Hateful Memes benchmark) and macro-F1 scores (up to +7.6 points on ArMeme), highlighting the method's potential for more balanced performance and natural-language explanations, which is significant for practitioners focused on explainability in AI models.