Research
Towards Robust Arabic Speech Emotion Recognition with Deep Learning
This study presents a comparative evaluation of three architectures for Arabic Speech Emotion Recognition (SER): a CNN-LSTM model, a CNN-Transformer model, and a fine-tuned wav2vec 2.0 model, with the CNN-Transformer achieving the highest accuracy of 98.1% on the EYASE and BAVED datasets. The research emphasizes the effectiveness of hybrid architectures in capturing both local spectral cues and long-range temporal dependencies, addressing challenges posed by dialectal diversity and limited annotated data in Arabic SER. These findings are significant for practitioners as they provide insights into effective model architectures for improving SER performance in low-resource languages.
speechemotion recognitiondeep learning