Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features

The detection and classification of emotional states in speech involves the analysis of audio signals and text transcriptions. There are complex relationships between the extracted features at different time intervals which ought to be analyzed to infer the emotions in speech. These relationships...

Full description

Saved in:

Bibliographic Details
Main Authors:	Samuel, Kakuba, Alwin, Poulose, Dong, Seog Han, Senior Member, Ieee
Format:	Article
Language:	English
Published:	IEEE 2023
Subjects:	Emotion recognition, Spatial features Temporal features Semantic tendency features Multi- head attention
Online Access:	http://hdl.handle.net/20.500.12493/921
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1800403075407544320
author	Samuel, Kakuba Alwin, Poulose Dong, Seog Han Senior Member, Ieee
author_facet	Samuel, Kakuba Alwin, Poulose Dong, Seog Han Senior Member, Ieee
author_sort	Samuel, Kakuba
collection	KAB-DR
description	The detection and classification of emotional states in speech involves the analysis of audio signals and text transcriptions. There are complex relationships between the extracted features at different time intervals which ought to be analyzed to infer the emotions in speech. These relationships can be represented as spatial, temporal and semantic tendency features. In addition to emotional features that exist in each modality, the text modality consists of semantic and grammatical tendencies in the uttered sentences. Spatial and temporal features have been extracted sequentially in deep learning-based models using convolutional neural networks (CNN) followed by recurrent neural networks (RNN) which may not only be weak at the detection of the separate spatial-temporal feature representations but also the semantic tendencies in speech. In this paper, we propose a deep learning-based model named concurrent spatial-temporal and grammatical (CoSTGA) model that concurrently learns spatial, temporal and semantic representations in the local feature learning block (LFLB) which are fused as a latent vector to form an input to the global feature learning block (GFLB). We also investigate the performance of multi-level feature fusion compared to single-level fusion using the multi-level transformer encoder model (MLTED) that we also propose in this paper. The proposed CoSTGA model uses multi-level fusion first at the LFLB level where similar features (spatial or temporal) are separately extracted from a modality and secondly at the GFLB level where the spatial-temporal features are fused with the semantic tendency features. The proposed CoSTGA model uses a combination of dilated causal convolutions (DCC), bidirectional long short-term memory (BiLSTM), transformer encoders (TE), multi-head and self-attention mechanisms. Acoustic and lexical features were extracted from the interactive emotional dyadic motion capture (IEMOCAP) dataset. The proposed model achieves 75.50% and 75.82% of weighted and unweighted accuracy, 75.32% and 75.57% of recall and F1 score respectively. These results imply that concurrently learned spatial-temporal features with semantic tendencies learned in a multi-level approach improve the model’s effectiveness and robustness.
format	Article
id	oai:idr.kab.ac.ug:20.500.12493-921
institution	KAB-DR
language	English
publishDate	2023
publisher	IEEE
record_format	dspace
spelling	oai:idr.kab.ac.ug:20.500.12493-9212024-01-17T04:46:31Z Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features Samuel, Kakuba Alwin, Poulose Dong, Seog Han Senior Member, Ieee Emotion recognition, Spatial features Temporal features Semantic tendency features Multi- head attention The detection and classification of emotional states in speech involves the analysis of audio signals and text transcriptions. There are complex relationships between the extracted features at different time intervals which ought to be analyzed to infer the emotions in speech. These relationships can be represented as spatial, temporal and semantic tendency features. In addition to emotional features that exist in each modality, the text modality consists of semantic and grammatical tendencies in the uttered sentences. Spatial and temporal features have been extracted sequentially in deep learning-based models using convolutional neural networks (CNN) followed by recurrent neural networks (RNN) which may not only be weak at the detection of the separate spatial-temporal feature representations but also the semantic tendencies in speech. In this paper, we propose a deep learning-based model named concurrent spatial-temporal and grammatical (CoSTGA) model that concurrently learns spatial, temporal and semantic representations in the local feature learning block (LFLB) which are fused as a latent vector to form an input to the global feature learning block (GFLB). We also investigate the performance of multi-level feature fusion compared to single-level fusion using the multi-level transformer encoder model (MLTED) that we also propose in this paper. The proposed CoSTGA model uses multi-level fusion first at the LFLB level where similar features (spatial or temporal) are separately extracted from a modality and secondly at the GFLB level where the spatial-temporal features are fused with the semantic tendency features. The proposed CoSTGA model uses a combination of dilated causal convolutions (DCC), bidirectional long short-term memory (BiLSTM), transformer encoders (TE), multi-head and self-attention mechanisms. Acoustic and lexical features were extracted from the interactive emotional dyadic motion capture (IEMOCAP) dataset. The proposed model achieves 75.50% and 75.82% of weighted and unweighted accuracy, 75.32% and 75.57% of recall and F1 score respectively. These results imply that concurrently learned spatial-temporal features with semantic tendencies learned in a multi-level approach improve the model’s effectiveness and robustness. Kabale University 2023-02-01T09:00:17Z 2023-02-01T09:00:17Z 2022 Article http://hdl.handle.net/20.500.12493/921 en Attribution-NonCommercial-NoDerivs 3.0 United States http://creativecommons.org/licenses/by-nc-nd/3.0/us/ application/pdf IEEE
spellingShingle	Emotion recognition, Spatial features Temporal features Semantic tendency features Multi- head attention Samuel, Kakuba Alwin, Poulose Dong, Seog Han Senior Member, Ieee Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features
title	Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features
title_full	Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features
title_fullStr	Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features
title_full_unstemmed	Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features
title_short	Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features
title_sort	deep learning based speech emotion recognition using multi level fusion of concurrent features
topic	Emotion recognition, Spatial features Temporal features Semantic tendency features Multi- head attention
url	http://hdl.handle.net/20.500.12493/921
work_keys_str_mv	AT samuelkakuba deeplearningbasedspeechemotionrecognitionusingmultilevelfusionofconcurrentfeatures AT alwinpoulose deeplearningbasedspeechemotionrecognitionusingmultilevelfusionofconcurrentfeatures AT dongseoghan deeplearningbasedspeechemotionrecognitionusingmultilevelfusionofconcurrentfeatures AT seniormemberieee deeplearningbasedspeechemotionrecognitionusingmultilevelfusionofconcurrentfeatures

Deep Learning-Based Speech Emotion Recognition Using Multi-Level Fusion of Concurrent Features

Similar Items