Knowledge enhancement for speech emotion recognition via multi-level acoustic feature

Speech emotion recognition (SER) has become an increasingly attractive machine learning task for domain applications. It aims to improve the discriminative capacity of speech emotion utilising a certain type of features (e.g. MFCC, Spectrograms, Wav2vec2) or multi-type combination features. However,...

Full description

Saved in:
Bibliographic Details
Main Authors: Huan Zhao, Nianxin Huang, Haijiao Chen
Format: Article
Language:English
Published: Taylor & Francis Group 2024-12-01
Series:Connection Science
Subjects:
Online Access:https://www.tandfonline.com/doi/10.1080/09540091.2024.2312103
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850214446002077696
author Huan Zhao
Nianxin Huang
Haijiao Chen
author_facet Huan Zhao
Nianxin Huang
Haijiao Chen
author_sort Huan Zhao
collection DOAJ
description Speech emotion recognition (SER) has become an increasingly attractive machine learning task for domain applications. It aims to improve the discriminative capacity of speech emotion utilising a certain type of features (e.g. MFCC, Spectrograms, Wav2vec2) or multi-type combination features. However, the potential of acoustic-related deep features is frequently overlooked in existing approaches that rely solely on a single type of feature or employ a basic combination of multiple feature types. To address this challenge, a multi-level acoustic feature cross-fusion approach is proposed, aiming to compensate for missing information between various features. It helps to enhance the SER performance by integrating different types of knowledge through the cross-fusion mechanism. Moreover, multi-task learning is utilised to share useful information through gender recognition, which can also obtain multiple common representations in a fine-grained space. Experimental results show that the fusion approach can capture the inner connections between multilevel acoustic features to refine the knowledge. The SOTA results were obtained under the same experimental conditions.
format Article
id doaj-art-856a9be7d7d74df7942ade9556c59a5b
institution OA Journals
issn 0954-0091
1360-0494
language English
publishDate 2024-12-01
publisher Taylor & Francis Group
record_format Article
series Connection Science
spelling doaj-art-856a9be7d7d74df7942ade9556c59a5b2025-08-20T02:08:54ZengTaylor & Francis GroupConnection Science0954-00911360-04942024-12-0136110.1080/09540091.2024.2312103Knowledge enhancement for speech emotion recognition via multi-level acoustic featureHuan Zhao0Nianxin Huang1Haijiao Chen2College of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, People's Republic of ChinaCollege of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, People's Republic of ChinaCollege of Computer Science and Electronic Engineering, Hunan University, Changsha, Hunan, People's Republic of ChinaSpeech emotion recognition (SER) has become an increasingly attractive machine learning task for domain applications. It aims to improve the discriminative capacity of speech emotion utilising a certain type of features (e.g. MFCC, Spectrograms, Wav2vec2) or multi-type combination features. However, the potential of acoustic-related deep features is frequently overlooked in existing approaches that rely solely on a single type of feature or employ a basic combination of multiple feature types. To address this challenge, a multi-level acoustic feature cross-fusion approach is proposed, aiming to compensate for missing information between various features. It helps to enhance the SER performance by integrating different types of knowledge through the cross-fusion mechanism. Moreover, multi-task learning is utilised to share useful information through gender recognition, which can also obtain multiple common representations in a fine-grained space. Experimental results show that the fusion approach can capture the inner connections between multilevel acoustic features to refine the knowledge. The SOTA results were obtained under the same experimental conditions.https://www.tandfonline.com/doi/10.1080/09540091.2024.2312103Cross-fusionmulti-level featuremulti-task learningspeech emotion recognition
spellingShingle Huan Zhao
Nianxin Huang
Haijiao Chen
Knowledge enhancement for speech emotion recognition via multi-level acoustic feature
Connection Science
Cross-fusion
multi-level feature
multi-task learning
speech emotion recognition
title Knowledge enhancement for speech emotion recognition via multi-level acoustic feature
title_full Knowledge enhancement for speech emotion recognition via multi-level acoustic feature
title_fullStr Knowledge enhancement for speech emotion recognition via multi-level acoustic feature
title_full_unstemmed Knowledge enhancement for speech emotion recognition via multi-level acoustic feature
title_short Knowledge enhancement for speech emotion recognition via multi-level acoustic feature
title_sort knowledge enhancement for speech emotion recognition via multi level acoustic feature
topic Cross-fusion
multi-level feature
multi-task learning
speech emotion recognition
url https://www.tandfonline.com/doi/10.1080/09540091.2024.2312103
work_keys_str_mv AT huanzhao knowledgeenhancementforspeechemotionrecognitionviamultilevelacousticfeature
AT nianxinhuang knowledgeenhancementforspeechemotionrecognitionviamultilevelacousticfeature
AT haijiaochen knowledgeenhancementforspeechemotionrecognitionviamultilevelacousticfeature