A Lightweight Tri-Stream Feature Fusion Network for Speech Emotion Recognition

Understanding and modeling emotions from speech is a fundamental challenge in speech processing and a key enabler of emotionally intelligent human-computer interaction. However, defining and extracting robust emotional features remains difficult due to the nuanced and context-dependent nature of hum...

Full description

Saved in:

Bibliographic Details
Main Authors:	Ronghe Cao, Yunxing Wang, Xiaolong Wu, Shuang Jin, Huiling Niu
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Speech emotion recognition swin transformer HuBERT artificial intelligence transformer pre-trained models
Online Access:	https://ieeexplore.ieee.org/document/11075664/
Tags:	Add Tag No Tags, Be the first to tag this record!

Description
Summary:	Understanding and modeling emotions from speech is a fundamental challenge in speech processing and a key enabler of emotionally intelligent human-computer interaction. However, defining and extracting robust emotional features remains difficult due to the nuanced and context-dependent nature of human affect. Existing approaches, focusing on prosodic features or deep representations from pre-trained models, often struggle to capture the full spectrum of emotional cues present in real-world speech. To address these limitations, we introduce Tri-Stream, a novel speech emotion recognition (SER) framework that concurrently leverages spectrogram and waveform modalities. Tri-Stream integrates three complementary feature streams: spectral patterns extracted via a Swin Transformer, deep acoustic representations from HuBERT, and engineered prosodic features capturing rhythmic information. These streams are fused and processed by a GRU-based classifier for final emotion prediction. Extensive evaluations on four benchmark datasets (IEMOCAP, SAVEE, RAVDESS, EMO-DB) demonstrate that Tri-Stream consistently outperforms state-of-the-art baselines, achieving 79.86% unweighted accuracy on IEMOCAP and best performance on the remaining datasets, highlighting its effectiveness and robustness across diverse emotional speech corpora.
ISSN:	2169-3536

A Lightweight Tri-Stream Feature Fusion Network for Speech Emotion Recognition

Similar Items