A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion Recognition

Speech Emotion Recognition (SER) technology analyzes speech characteristics in human-computer interactions to understand user intent and improve interaction experience. It is widely used in the field of intelligent interaction. The significant challenge is to recognize the speech emotion of the spea...

Full description

Saved in:
Bibliographic Details
Main Authors: Sijia Fei, Qiang Feng, Fei Gao
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10993381/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850136840911192064
author Sijia Fei
Qiang Feng
Fei Gao
author_facet Sijia Fei
Qiang Feng
Fei Gao
author_sort Sijia Fei
collection DOAJ
description Speech Emotion Recognition (SER) technology analyzes speech characteristics in human-computer interactions to understand user intent and improve interaction experience. It is widely used in the field of intelligent interaction. The significant challenge is to recognize the speech emotion of the speaker faster and more precisely. To address this issue, we propose a lightweight forward-backward independent temporal-aware causal network termed I-TCN to construct bi-directional efficient representations of causality in speech time domain sequences. Specifically, a forward temporal-aware component constructed with dilated causal convolutions and skip connections is deployed to perform forward sequences feature modeling, which captures forward semantic information and causal relationships in the speech signal. It facilitates the prediction of emotional changes in the future. Furthermore, the backward temporal-aware module uses dilated causal convolutions to learn backward information and weighted fusion of multi-level temporal features to enhance the perception of backward emotion changes. Finally, different levels of forward-backward features are fused to refine historical-future emotion change trends and better perceive the details of emotion changes. Experimental results on six different linguistic datasets (EMODB: 95.52%; EMOVO: 92.00%; RAVDESS: 93.75%; SAVEE: 88.54%; CASIA: 94.50%; IEMOCAP: 71.47%) show that the emotion recognition capability of the proposed method is extremely competitive with state-of-the-art technologies. Meanwhile, the numerical results show that the proposed method has a good application prospect with a small number of parameters (0.21M) and low computational cost (80.72 MFLOPs).
format Article
id doaj-art-d19bdb624e3e45a586f462fdcf9954b3
institution OA Journals
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-d19bdb624e3e45a586f462fdcf9954b32025-08-20T02:31:00ZengIEEEIEEE Access2169-35362025-01-0113829148292610.1109/ACCESS.2025.356795410993381A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion RecognitionSijia Fei0Qiang Feng1https://orcid.org/0000-0002-6842-8404Fei Gao2https://orcid.org/0000-0003-4398-7194School of Media and Foreign Languages, Xi’an Jiaotong University City College, Xi’an, ChinaSchool of Automation Engineering, University of Electronic Science and Technology of China, Chengdu, ChinaZhan Tianyou College (CRRC College), Dalian Jiaotong University, Dalian, ChinaSpeech Emotion Recognition (SER) technology analyzes speech characteristics in human-computer interactions to understand user intent and improve interaction experience. It is widely used in the field of intelligent interaction. The significant challenge is to recognize the speech emotion of the speaker faster and more precisely. To address this issue, we propose a lightweight forward-backward independent temporal-aware causal network termed I-TCN to construct bi-directional efficient representations of causality in speech time domain sequences. Specifically, a forward temporal-aware component constructed with dilated causal convolutions and skip connections is deployed to perform forward sequences feature modeling, which captures forward semantic information and causal relationships in the speech signal. It facilitates the prediction of emotional changes in the future. Furthermore, the backward temporal-aware module uses dilated causal convolutions to learn backward information and weighted fusion of multi-level temporal features to enhance the perception of backward emotion changes. Finally, different levels of forward-backward features are fused to refine historical-future emotion change trends and better perceive the details of emotion changes. Experimental results on six different linguistic datasets (EMODB: 95.52%; EMOVO: 92.00%; RAVDESS: 93.75%; SAVEE: 88.54%; CASIA: 94.50%; IEMOCAP: 71.47%) show that the emotion recognition capability of the proposed method is extremely competitive with state-of-the-art technologies. Meanwhile, the numerical results show that the proposed method has a good application prospect with a small number of parameters (0.21M) and low computational cost (80.72 MFLOPs).https://ieeexplore.ieee.org/document/10993381/Speech emotion recognitiontemporal modelingmulti-level feature fusiondilated causal convolution
spellingShingle Sijia Fei
Qiang Feng
Fei Gao
A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion Recognition
IEEE Access
Speech emotion recognition
temporal modeling
multi-level feature fusion
dilated causal convolution
title A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion Recognition
title_full A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion Recognition
title_fullStr A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion Recognition
title_full_unstemmed A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion Recognition
title_short A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion Recognition
title_sort lightweight forward x2013 backward independent temporal aware causal network for speech emotion recognition
topic Speech emotion recognition
temporal modeling
multi-level feature fusion
dilated causal convolution
url https://ieeexplore.ieee.org/document/10993381/
work_keys_str_mv AT sijiafei alightweightforwardx2013backwardindependenttemporalawarecausalnetworkforspeechemotionrecognition
AT qiangfeng alightweightforwardx2013backwardindependenttemporalawarecausalnetworkforspeechemotionrecognition
AT feigao alightweightforwardx2013backwardindependenttemporalawarecausalnetworkforspeechemotionrecognition
AT sijiafei lightweightforwardx2013backwardindependenttemporalawarecausalnetworkforspeechemotionrecognition
AT qiangfeng lightweightforwardx2013backwardindependenttemporalawarecausalnetworkforspeechemotionrecognition
AT feigao lightweightforwardx2013backwardindependenttemporalawarecausalnetworkforspeechemotionrecognition