Paper
16 October 2023 Multimodal emotion recognition based on multilevel acoustic and textual information
Ya Zhou, Yuyi Xing, Guimin Huang, Qingkai Guo, Nanxiao Deng
Author Affiliations +
Proceedings Volume 12803, Fifth International Conference on Artificial Intelligence and Computer Science (AICS 2023); 1280327 (2023) https://doi.org/10.1117/12.3009468
Event: 2023 5th International Conference on Artificial Intelligence and Computer Science (AICS 2023), 2023, Wuhan, China
Abstract
The study and application of multimodal emotion recognition have gained significant popularity in recent years, representing one of the challenging tasks in the field of affective computing. We propose a multimodal speech emotion recognition model that utilizes multiple acoustic and textual information layers. This model incorporates transcribed textual data to complement speech data and enable accurate emotion recognition. In the unimodal model, we employ AlexNet, BiGRU, and HuBERT to extract multi-layer acoustic feature information from speech, and the RoBERTa encoder to extract text features. Additionally, we perform fusion between speech and text by utilizing the co-attentive mechanism to extract complementary information across modalities and eliminate inter-modality noise. This process ultimately enhances the emotional representation of the target modality. Finally, the fused features are utilized to predict the emotion category. Our model achieved a weighted sentiment recognition accuracy of 77.41% and an unweighted accuracy of 78.66%.
(2023) Published by SPIE. Downloading of the abstract is permitted for personal use only.
Ya Zhou, Yuyi Xing, Guimin Huang, Qingkai Guo, and Nanxiao Deng "Multimodal emotion recognition based on multilevel acoustic and textual information", Proc. SPIE 12803, Fifth International Conference on Artificial Intelligence and Computer Science (AICS 2023), 1280327 (16 October 2023); https://doi.org/10.1117/12.3009468
Advertisement
Advertisement
RIGHTS & PERMISSIONS
Get copyright permission  Get copyright permission on Copyright Marketplace
KEYWORDS
Emotion

Speech recognition

Acoustics

Feature extraction

Feature fusion

Machine learning

Data modeling

Back to Top