Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution
The success of deep learning in speech emotion recognition has led to its application in resource-constrained devices. It has been applied in human-to-machine interaction applications like social living assistance, authentication, health monitoring and alertness systems. In order to ensure a good...
Saved in:
Main Authors: | , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2023
|
Subjects: | |
Online Access: | http://hdl.handle.net/20.500.12493/920 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1800403071086362624 |
---|---|
author | Samuel, Kakuba Alwin, Poulose |
author_facet | Samuel, Kakuba Alwin, Poulose |
author_sort | Samuel, Kakuba |
collection | KAB-DR |
description | The success of deep learning in speech emotion recognition has led to its application in
resource-constrained devices. It has been applied in human-to-machine interaction applications like social
living assistance, authentication, health monitoring and alertness systems. In order to ensure a good user
experience, robust, accurate and computationally efficient deep learning models are necessary. Recurrent
neural networks (RNN) like long short-term memory (LSTM), gated recurrent units (GRU) and their variants
that operate sequentially are often used to learn time series sequences of the signal, analyze long-term
dependencies and the contexts of the utterances in the speech signal. However, due to their sequential
operation, they encounter problems in convergence and sluggish training that uses a lot of memory resources
and encounters the vanishing gradient problem. In addition, they do not consider spatial cues that may exist
in the speech signal. Therefore, we propose an attention-based multi-learning model (ABMD) that uses
residual dilated causal convolution (RDCC) blocks and dilated convolution (DC) layers with multi-head
attention. The proposed ABMD model achieves comparable performance while taking global contextualized
long-term dependencies between features in a parallel manner using a large receptive field with less increase
in the number of parameters compared to the number of layers and considers spatial cues among the speech
features. Spectral and voice quality features extracted from the raw speech signals are used as inputs.
The proposed ABMD model obtained a recognition accuracy and F1 score of 93.75% and 92.50% on the
SAVEE datasets, 85.89% and 85.34% on the RAVDESS datasets and 95.93% and 95.83% on the EMODB
datasets. The model’s robustness in terms of the confusion ratio of the individual discrete emotions especially
happiness which is often confused with emotions that belong to the same dimensional plane with it also
improved when validated on the same datasets |
format | Article |
id | oai:idr.kab.ac.ug:20.500.12493-920 |
institution | KAB-DR |
language | English |
publishDate | 2023 |
publisher | IEEE |
record_format | dspace |
spelling | oai:idr.kab.ac.ug:20.500.12493-9202024-01-17T04:44:23Z Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution Samuel, Kakuba Alwin, Poulose Emotion recognition Residual dilated causal convolution Multi-head attention, The success of deep learning in speech emotion recognition has led to its application in resource-constrained devices. It has been applied in human-to-machine interaction applications like social living assistance, authentication, health monitoring and alertness systems. In order to ensure a good user experience, robust, accurate and computationally efficient deep learning models are necessary. Recurrent neural networks (RNN) like long short-term memory (LSTM), gated recurrent units (GRU) and their variants that operate sequentially are often used to learn time series sequences of the signal, analyze long-term dependencies and the contexts of the utterances in the speech signal. However, due to their sequential operation, they encounter problems in convergence and sluggish training that uses a lot of memory resources and encounters the vanishing gradient problem. In addition, they do not consider spatial cues that may exist in the speech signal. Therefore, we propose an attention-based multi-learning model (ABMD) that uses residual dilated causal convolution (RDCC) blocks and dilated convolution (DC) layers with multi-head attention. The proposed ABMD model achieves comparable performance while taking global contextualized long-term dependencies between features in a parallel manner using a large receptive field with less increase in the number of parameters compared to the number of layers and considers spatial cues among the speech features. Spectral and voice quality features extracted from the raw speech signals are used as inputs. The proposed ABMD model obtained a recognition accuracy and F1 score of 93.75% and 92.50% on the SAVEE datasets, 85.89% and 85.34% on the RAVDESS datasets and 95.93% and 95.83% on the EMODB datasets. The model’s robustness in terms of the confusion ratio of the individual discrete emotions especially happiness which is often confused with emotions that belong to the same dimensional plane with it also improved when validated on the same datasets Kabale University 2023-02-01T08:42:01Z 2023-02-01T08:42:01Z 2022-11-21 Article http://hdl.handle.net/20.500.12493/920 en Attribution-NonCommercial-NoDerivs 3.0 United States http://creativecommons.org/licenses/by-nc-nd/3.0/us/ application/pdf IEEE |
spellingShingle | Emotion recognition Residual dilated causal convolution Multi-head attention, Samuel, Kakuba Alwin, Poulose Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution |
title | Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution |
title_full | Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution |
title_fullStr | Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution |
title_full_unstemmed | Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution |
title_short | Attention-Based Multi-Learning Approach for Speech Emotion Recognition With Dilated Convolution |
title_sort | attention based multi learning approach for speech emotion recognition with dilated convolution |
topic | Emotion recognition Residual dilated causal convolution Multi-head attention, |
url | http://hdl.handle.net/20.500.12493/920 |
work_keys_str_mv | AT samuelkakuba attentionbasedmultilearningapproachforspeechemotionrecognitionwithdilatedconvolution AT alwinpoulose attentionbasedmultilearningapproachforspeechemotionrecognitionwithdilatedconvolution |