SkelETT—Skeleton-to-Emotion Transfer Transformer
Emotion recognition plays an essential role in human-computer interaction, spanning diverse domains from human-robot communication and virtual reality to mental health assessment and affective computing. Traditionally, this field has heavily relied on visual and auditory cues, such as facial express...
Saved in:
Main Authors: | , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
IEEE
2025-01-01
|
Series: | IEEE Access |
Subjects: | |
Online Access: | https://ieeexplore.ieee.org/document/10852297/ |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1823859610823426048 |
---|---|
author | Pedro Victor Vieira Paiva Josue Junior Guimaraes Ramos Marina Gavrilova Marco Antonio Garcia de Carvalho |
author_facet | Pedro Victor Vieira Paiva Josue Junior Guimaraes Ramos Marina Gavrilova Marco Antonio Garcia de Carvalho |
author_sort | Pedro Victor Vieira Paiva |
collection | DOAJ |
description | Emotion recognition plays an essential role in human-computer interaction, spanning diverse domains from human-robot communication and virtual reality to mental health assessment and affective computing. Traditionally, this field has heavily relied on visual and auditory cues, such as facial expressions and speech analysis. However, these modalities alone may not comprehensively capture the full spectrum of human emotion and suffer limitations due to noise or occlusion. Human skeletons, derived from depth sensors or pose estimation algorithms, offer an alternative for facial expression, including valuable spatial and temporal cues. In this paper, we introduce a novel approach to emotion recognition by pre-training a transformer model on a large dataset of unsupervised human skeleton representations and subsequently fine-tuning it for emotion classification. By exposing the model to an extensive corpus of unlabeled human skeleton data, we can effectively learn to represent complex spatial and temporal dependencies inherent in body movements. Following this foundational knowledge acquisition, the model undergoes fine-tuning on a smaller, labeled dataset tailored for emotion classification tasks. We introduce SkelETT, an encoder-only transformer architecture for body emotion recognition. Comprising a series of encoder layers, SkelETT patches 2D body pose representations, it also includes multi-head self-attention mechanisms and position-wise feed-forward networks, providing a powerful framework for extracting hierarchical features from sequential body pose data. We propose and evaluate the impact of different fine-tuning strategies on pose data using the MPOSE action recognition dataset as a pre-training source. Transfer performance is measured on the BoLD body emotion recognition dataset. Compared to the state-of-the-art, we report significant gains in accuracy (<inline-formula> <tex-math notation="LaTeX">$\approx ~34$ </tex-math></inline-formula>% higher), training time (<inline-formula> <tex-math notation="LaTeX">$\approx ~50$ </tex-math></inline-formula>% less), and model complexity reduction (<inline-formula> <tex-math notation="LaTeX">$\approx ~80$ </tex-math></inline-formula>% less trainable parameters). |
format | Article |
id | doaj-art-5a6896acbef94353a500b11e3719f10f |
institution | Kabale University |
issn | 2169-3536 |
language | English |
publishDate | 2025-01-01 |
publisher | IEEE |
record_format | Article |
series | IEEE Access |
spelling | doaj-art-5a6896acbef94353a500b11e3719f10f2025-02-11T00:01:31ZengIEEEIEEE Access2169-35362025-01-0113233442335810.1109/ACCESS.2025.353414510852297SkelETT—Skeleton-to-Emotion Transfer TransformerPedro Victor Vieira Paiva0https://orcid.org/0000-0002-3743-1985Josue Junior Guimaraes Ramos1https://orcid.org/0000-0002-5815-2424Marina Gavrilova2https://orcid.org/0000-0002-5338-1834Marco Antonio Garcia de Carvalho3https://orcid.org/0000-0002-1941-6036School of Technology, University of Campinas, Limeira, BrazilRenato Archer IT Center, Campinas, BrazilDepartment of Computer Science, University of Calgary, Calgary, AB, CanadaSchool of Technology, University of Campinas, Limeira, BrazilEmotion recognition plays an essential role in human-computer interaction, spanning diverse domains from human-robot communication and virtual reality to mental health assessment and affective computing. Traditionally, this field has heavily relied on visual and auditory cues, such as facial expressions and speech analysis. However, these modalities alone may not comprehensively capture the full spectrum of human emotion and suffer limitations due to noise or occlusion. Human skeletons, derived from depth sensors or pose estimation algorithms, offer an alternative for facial expression, including valuable spatial and temporal cues. In this paper, we introduce a novel approach to emotion recognition by pre-training a transformer model on a large dataset of unsupervised human skeleton representations and subsequently fine-tuning it for emotion classification. By exposing the model to an extensive corpus of unlabeled human skeleton data, we can effectively learn to represent complex spatial and temporal dependencies inherent in body movements. Following this foundational knowledge acquisition, the model undergoes fine-tuning on a smaller, labeled dataset tailored for emotion classification tasks. We introduce SkelETT, an encoder-only transformer architecture for body emotion recognition. Comprising a series of encoder layers, SkelETT patches 2D body pose representations, it also includes multi-head self-attention mechanisms and position-wise feed-forward networks, providing a powerful framework for extracting hierarchical features from sequential body pose data. We propose and evaluate the impact of different fine-tuning strategies on pose data using the MPOSE action recognition dataset as a pre-training source. Transfer performance is measured on the BoLD body emotion recognition dataset. Compared to the state-of-the-art, we report significant gains in accuracy (<inline-formula> <tex-math notation="LaTeX">$\approx ~34$ </tex-math></inline-formula>% higher), training time (<inline-formula> <tex-math notation="LaTeX">$\approx ~50$ </tex-math></inline-formula>% less), and model complexity reduction (<inline-formula> <tex-math notation="LaTeX">$\approx ~80$ </tex-math></inline-formula>% less trainable parameters).https://ieeexplore.ieee.org/document/10852297/Attention-based designbody emotion recognitiongait analysismasked autoencoderaffective computing |
spellingShingle | Pedro Victor Vieira Paiva Josue Junior Guimaraes Ramos Marina Gavrilova Marco Antonio Garcia de Carvalho SkelETT—Skeleton-to-Emotion Transfer Transformer IEEE Access Attention-based design body emotion recognition gait analysis masked autoencoder affective computing |
title | SkelETT—Skeleton-to-Emotion Transfer Transformer |
title_full | SkelETT—Skeleton-to-Emotion Transfer Transformer |
title_fullStr | SkelETT—Skeleton-to-Emotion Transfer Transformer |
title_full_unstemmed | SkelETT—Skeleton-to-Emotion Transfer Transformer |
title_short | SkelETT—Skeleton-to-Emotion Transfer Transformer |
title_sort | skelett x2014 skeleton to emotion transfer transformer |
topic | Attention-based design body emotion recognition gait analysis masked autoencoder affective computing |
url | https://ieeexplore.ieee.org/document/10852297/ |
work_keys_str_mv | AT pedrovictorvieirapaiva skelettx2014skeletontoemotiontransfertransformer AT josuejuniorguimaraesramos skelettx2014skeletontoemotiontransfertransformer AT marinagavrilova skelettx2014skeletontoemotiontransfertransformer AT marcoantoniogarciadecarvalho skelettx2014skeletontoemotiontransfertransformer |