A Lightweight Forward–Backward Independent Temporal-Aware Causal Network for Speech Emotion Recognition

Speech Emotion Recognition (SER) technology analyzes speech characteristics in human-computer interactions to understand user intent and improve interaction experience. It is widely used in the field of intelligent interaction. The significant challenge is to recognize the speech emotion of the spea...

Full description

Saved in:
Bibliographic Details
Main Authors: Sijia Fei, Qiang Feng, Fei Gao
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10993381/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Speech Emotion Recognition (SER) technology analyzes speech characteristics in human-computer interactions to understand user intent and improve interaction experience. It is widely used in the field of intelligent interaction. The significant challenge is to recognize the speech emotion of the speaker faster and more precisely. To address this issue, we propose a lightweight forward-backward independent temporal-aware causal network termed I-TCN to construct bi-directional efficient representations of causality in speech time domain sequences. Specifically, a forward temporal-aware component constructed with dilated causal convolutions and skip connections is deployed to perform forward sequences feature modeling, which captures forward semantic information and causal relationships in the speech signal. It facilitates the prediction of emotional changes in the future. Furthermore, the backward temporal-aware module uses dilated causal convolutions to learn backward information and weighted fusion of multi-level temporal features to enhance the perception of backward emotion changes. Finally, different levels of forward-backward features are fused to refine historical-future emotion change trends and better perceive the details of emotion changes. Experimental results on six different linguistic datasets (EMODB: 95.52%; EMOVO: 92.00%; RAVDESS: 93.75%; SAVEE: 88.54%; CASIA: 94.50%; IEMOCAP: 71.47%) show that the emotion recognition capability of the proposed method is extremely competitive with state-of-the-art technologies. Meanwhile, the numerical results show that the proposed method has a good application prospect with a small number of parameters (0.21M) and low computational cost (80.72 MFLOPs).
ISSN:2169-3536