A Lightweight Tri-Stream Feature Fusion Network for Speech Emotion Recognition

Understanding and modeling emotions from speech is a fundamental challenge in speech processing and a key enabler of emotionally intelligent human-computer interaction. However, defining and extracting robust emotional features remains difficult due to the nuanced and context-dependent nature of hum...

Full description

Saved in:
Bibliographic Details
Main Authors: Ronghe Cao, Yunxing Wang, Xiaolong Wu, Shuang Jin, Huiling Niu
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/11075664/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Understanding and modeling emotions from speech is a fundamental challenge in speech processing and a key enabler of emotionally intelligent human-computer interaction. However, defining and extracting robust emotional features remains difficult due to the nuanced and context-dependent nature of human affect. Existing approaches, focusing on prosodic features or deep representations from pre-trained models, often struggle to capture the full spectrum of emotional cues present in real-world speech. To address these limitations, we introduce Tri-Stream, a novel speech emotion recognition (SER) framework that concurrently leverages spectrogram and waveform modalities. Tri-Stream integrates three complementary feature streams: spectral patterns extracted via a Swin Transformer, deep acoustic representations from HuBERT, and engineered prosodic features capturing rhythmic information. These streams are fused and processed by a GRU-based classifier for final emotion prediction. Extensive evaluations on four benchmark datasets (IEMOCAP, SAVEE, RAVDESS, EMO-DB) demonstrate that Tri-Stream consistently outperforms state-of-the-art baselines, achieving 79.86% unweighted accuracy on IEMOCAP and best performance on the remaining datasets, highlighting its effectiveness and robustness across diverse emotional speech corpora.
ISSN:2169-3536