Multimodal
Frozen Multimodal Embeddings for Personality and Cognitive Ability Assessment in Asynchronous Video Interviews
This paper presents a method for assessing personality traits and cognitive abilities from asynchronous video interviews (AVIs) using frozen multimodal encoders, specifically CLIP for visual features, Whisper for audio, and RoBERTa, E5, and DeBERTaV3 for text. For personality trait prediction, the proposed system achieves a mean squared error (MSE) of 0.2696, significantly improving upon the baseline of 0.3334, while cognitive ability classification reaches an accuracy of 0.5313, above the baseline of 0.4062. These results highlight the effectiveness of trait-specific multimodal modeling in psychological assessments, although they also emphasize the need to address potential shortcuts in cognitive ability predictions.
personality assessmentvideo interviewsmultimodal learning