SkelETT—Skeleton-to-Emotion Transfer Transformer

Emotion recognition plays an essential role in human-computer interaction, spanning diverse domains from human-robot communication and virtual reality to mental health assessment and affective computing. Traditionally, this field has heavily relied on visual and auditory cues, such as facial express...

Full description

Saved in:
Bibliographic Details
Main Authors: Pedro Victor Vieira Paiva, Josue Junior Guimaraes Ramos, Marina Gavrilova, Marco Antonio Garcia de Carvalho
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10852297/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823859610823426048
author Pedro Victor Vieira Paiva
Josue Junior Guimaraes Ramos
Marina Gavrilova
Marco Antonio Garcia de Carvalho
author_facet Pedro Victor Vieira Paiva
Josue Junior Guimaraes Ramos
Marina Gavrilova
Marco Antonio Garcia de Carvalho
author_sort Pedro Victor Vieira Paiva
collection DOAJ
description Emotion recognition plays an essential role in human-computer interaction, spanning diverse domains from human-robot communication and virtual reality to mental health assessment and affective computing. Traditionally, this field has heavily relied on visual and auditory cues, such as facial expressions and speech analysis. However, these modalities alone may not comprehensively capture the full spectrum of human emotion and suffer limitations due to noise or occlusion. Human skeletons, derived from depth sensors or pose estimation algorithms, offer an alternative for facial expression, including valuable spatial and temporal cues. In this paper, we introduce a novel approach to emotion recognition by pre-training a transformer model on a large dataset of unsupervised human skeleton representations and subsequently fine-tuning it for emotion classification. By exposing the model to an extensive corpus of unlabeled human skeleton data, we can effectively learn to represent complex spatial and temporal dependencies inherent in body movements. Following this foundational knowledge acquisition, the model undergoes fine-tuning on a smaller, labeled dataset tailored for emotion classification tasks. We introduce SkelETT, an encoder-only transformer architecture for body emotion recognition. Comprising a series of encoder layers, SkelETT patches 2D body pose representations, it also includes multi-head self-attention mechanisms and position-wise feed-forward networks, providing a powerful framework for extracting hierarchical features from sequential body pose data. We propose and evaluate the impact of different fine-tuning strategies on pose data using the MPOSE action recognition dataset as a pre-training source. Transfer performance is measured on the BoLD body emotion recognition dataset. Compared to the state-of-the-art, we report significant gains in accuracy (<inline-formula> <tex-math notation="LaTeX">$\approx ~34$ </tex-math></inline-formula>% higher), training time (<inline-formula> <tex-math notation="LaTeX">$\approx ~50$ </tex-math></inline-formula>% less), and model complexity reduction (<inline-formula> <tex-math notation="LaTeX">$\approx ~80$ </tex-math></inline-formula>% less trainable parameters).
format Article
id doaj-art-5a6896acbef94353a500b11e3719f10f
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-5a6896acbef94353a500b11e3719f10f2025-02-11T00:01:31ZengIEEEIEEE Access2169-35362025-01-0113233442335810.1109/ACCESS.2025.353414510852297SkelETT&#x2014;Skeleton-to-Emotion Transfer TransformerPedro Victor Vieira Paiva0https://orcid.org/0000-0002-3743-1985Josue Junior Guimaraes Ramos1https://orcid.org/0000-0002-5815-2424Marina Gavrilova2https://orcid.org/0000-0002-5338-1834Marco Antonio Garcia de Carvalho3https://orcid.org/0000-0002-1941-6036School of Technology, University of Campinas, Limeira, BrazilRenato Archer IT Center, Campinas, BrazilDepartment of Computer Science, University of Calgary, Calgary, AB, CanadaSchool of Technology, University of Campinas, Limeira, BrazilEmotion recognition plays an essential role in human-computer interaction, spanning diverse domains from human-robot communication and virtual reality to mental health assessment and affective computing. Traditionally, this field has heavily relied on visual and auditory cues, such as facial expressions and speech analysis. However, these modalities alone may not comprehensively capture the full spectrum of human emotion and suffer limitations due to noise or occlusion. Human skeletons, derived from depth sensors or pose estimation algorithms, offer an alternative for facial expression, including valuable spatial and temporal cues. In this paper, we introduce a novel approach to emotion recognition by pre-training a transformer model on a large dataset of unsupervised human skeleton representations and subsequently fine-tuning it for emotion classification. By exposing the model to an extensive corpus of unlabeled human skeleton data, we can effectively learn to represent complex spatial and temporal dependencies inherent in body movements. Following this foundational knowledge acquisition, the model undergoes fine-tuning on a smaller, labeled dataset tailored for emotion classification tasks. We introduce SkelETT, an encoder-only transformer architecture for body emotion recognition. Comprising a series of encoder layers, SkelETT patches 2D body pose representations, it also includes multi-head self-attention mechanisms and position-wise feed-forward networks, providing a powerful framework for extracting hierarchical features from sequential body pose data. We propose and evaluate the impact of different fine-tuning strategies on pose data using the MPOSE action recognition dataset as a pre-training source. Transfer performance is measured on the BoLD body emotion recognition dataset. Compared to the state-of-the-art, we report significant gains in accuracy (<inline-formula> <tex-math notation="LaTeX">$\approx ~34$ </tex-math></inline-formula>% higher), training time (<inline-formula> <tex-math notation="LaTeX">$\approx ~50$ </tex-math></inline-formula>% less), and model complexity reduction (<inline-formula> <tex-math notation="LaTeX">$\approx ~80$ </tex-math></inline-formula>% less trainable parameters).https://ieeexplore.ieee.org/document/10852297/Attention-based designbody emotion recognitiongait analysismasked autoencoderaffective computing
spellingShingle Pedro Victor Vieira Paiva
Josue Junior Guimaraes Ramos
Marina Gavrilova
Marco Antonio Garcia de Carvalho
SkelETT&#x2014;Skeleton-to-Emotion Transfer Transformer
IEEE Access
Attention-based design
body emotion recognition
gait analysis
masked autoencoder
affective computing
title SkelETT&#x2014;Skeleton-to-Emotion Transfer Transformer
title_full SkelETT&#x2014;Skeleton-to-Emotion Transfer Transformer
title_fullStr SkelETT&#x2014;Skeleton-to-Emotion Transfer Transformer
title_full_unstemmed SkelETT&#x2014;Skeleton-to-Emotion Transfer Transformer
title_short SkelETT&#x2014;Skeleton-to-Emotion Transfer Transformer
title_sort skelett x2014 skeleton to emotion transfer transformer
topic Attention-based design
body emotion recognition
gait analysis
masked autoencoder
affective computing
url https://ieeexplore.ieee.org/document/10852297/
work_keys_str_mv AT pedrovictorvieirapaiva skelettx2014skeletontoemotiontransfertransformer
AT josuejuniorguimaraesramos skelettx2014skeletontoemotiontransfertransformer
AT marinagavrilova skelettx2014skeletontoemotiontransfertransformer
AT marcoantoniogarciadecarvalho skelettx2014skeletontoemotiontransfertransformer