A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion Recognition
Speech Emotion Recognition (SER) technology analyzes speech characteristics in human-computer interactions to understand user intent and improve interaction experience. It is widely used in the field of intelligent interaction. The significant challenge is to recognize the speech emotion of the spea...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10993381/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850136840911192064 |
|---|---|
| author | Sijia Fei Qiang Feng Fei Gao |
| author_facet | Sijia Fei Qiang Feng Fei Gao |
| author_sort | Sijia Fei |
| collection | DOAJ |
| description | Speech Emotion Recognition (SER) technology analyzes speech characteristics in human-computer interactions to understand user intent and improve interaction experience. It is widely used in the field of intelligent interaction. The significant challenge is to recognize the speech emotion of the speaker faster and more precisely. To address this issue, we propose a lightweight forward-backward independent temporal-aware causal network termed I-TCN to construct bi-directional efficient representations of causality in speech time domain sequences. Specifically, a forward temporal-aware component constructed with dilated causal convolutions and skip connections is deployed to perform forward sequences feature modeling, which captures forward semantic information and causal relationships in the speech signal. It facilitates the prediction of emotional changes in the future. Furthermore, the backward temporal-aware module uses dilated causal convolutions to learn backward information and weighted fusion of multi-level temporal features to enhance the perception of backward emotion changes. Finally, different levels of forward-backward features are fused to refine historical-future emotion change trends and better perceive the details of emotion changes. Experimental results on six different linguistic datasets (EMODB: 95.52%; EMOVO: 92.00%; RAVDESS: 93.75%; SAVEE: 88.54%; CASIA: 94.50%; IEMOCAP: 71.47%) show that the emotion recognition capability of the proposed method is extremely competitive with state-of-the-art technologies. Meanwhile, the numerical results show that the proposed method has a good application prospect with a small number of parameters (0.21M) and low computational cost (80.72 MFLOPs). |
| format | Article |
| id | doaj-art-d19bdb624e3e45a586f462fdcf9954b3 |
| institution | OA Journals |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-d19bdb624e3e45a586f462fdcf9954b32025-08-20T02:31:00ZengIEEEIEEE Access2169-35362025-01-0113829148292610.1109/ACCESS.2025.356795410993381A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion RecognitionSijia Fei0Qiang Feng1https://orcid.org/0000-0002-6842-8404Fei Gao2https://orcid.org/0000-0003-4398-7194School of Media and Foreign Languages, Xi’an Jiaotong University City College, Xi’an, ChinaSchool of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, ChinaZhan Tianyou College (CRRC College), Dalian Jiaotong University, Dalian, ChinaSpeech Emotion Recognition (SER) technology analyzes speech characteristics in human-computer interactions to understand user intent and improve interaction experience. It is widely used in the field of intelligent interaction. The significant challenge is to recognize the speech emotion of the speaker faster and more precisely. To address this issue, we propose a lightweight forward-backward independent temporal-aware causal network termed I-TCN to construct bi-directional efficient representations of causality in speech time domain sequences. Specifically, a forward temporal-aware component constructed with dilated causal convolutions and skip connections is deployed to perform forward sequences feature modeling, which captures forward semantic information and causal relationships in the speech signal. It facilitates the prediction of emotional changes in the future. Furthermore, the backward temporal-aware module uses dilated causal convolutions to learn backward information and weighted fusion of multi-level temporal features to enhance the perception of backward emotion changes. Finally, different levels of forward-backward features are fused to refine historical-future emotion change trends and better perceive the details of emotion changes. Experimental results on six different linguistic datasets (EMODB: 95.52%; EMOVO: 92.00%; RAVDESS: 93.75%; SAVEE: 88.54%; CASIA: 94.50%; IEMOCAP: 71.47%) show that the emotion recognition capability of the proposed method is extremely competitive with state-of-the-art technologies. Meanwhile, the numerical results show that the proposed method has a good application prospect with a small number of parameters (0.21M) and low computational cost (80.72 MFLOPs).https://ieeexplore.ieee.org/document/10993381/Speech emotion recognitiontemporal modelingmulti-level feature fusiondilated causal convolution |
| spellingShingle | Sijia Fei Qiang Feng Fei Gao A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion Recognition IEEE Access Speech emotion recognition temporal modeling multi-level feature fusion dilated causal convolution |
| title | A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion Recognition |
| title_full | A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion Recognition |
| title_fullStr | A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion Recognition |
| title_full_unstemmed | A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion Recognition |
| title_short | A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion Recognition |
| title_sort | lightweight forward x2013 backward independent temporal aware causal network for speech emotion recognition |
| topic | Speech emotion recognition temporal modeling multi-level feature fusion dilated causal convolution |
| url | https://ieeexplore.ieee.org/document/10993381/ |
| work_keys_str_mv | AT sijiafei alightweightforwardx2013backwardindependenttemporalawarecausalnetworkforspeechemotionrecognition AT qiangfeng alightweightforwardx2013backwardindependenttemporalawarecausalnetworkforspeechemotionrecognition AT feigao alightweightforwardx2013backwardindependenttemporalawarecausalnetworkforspeechemotionrecognition AT sijiafei lightweightforwardx2013backwardindependenttemporalawarecausalnetworkforspeechemotionrecognition AT qiangfeng lightweightforwardx2013backwardindependenttemporalawarecausalnetworkforspeechemotionrecognition AT feigao lightweightforwardx2013backwardindependenttemporalawarecausalnetworkforspeechemotionrecognition |