SkelETT—Skeleton-to-Emotion Transfer Transformer

Emotion recognition plays an essential role in human-computer interaction, spanning diverse domains from human-robot communication and virtual reality to mental health assessment and affective computing. Traditionally, this field has heavily relied on visual and auditory cues, such as facial express...

Full description

Saved in:

Bibliographic Details
Main Authors:	Pedro Victor Vieira Paiva, Josue Junior Guimaraes Ramos, Marina Gavrilova, Marco Antonio Garcia de Carvalho
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	Attention-based design body emotion recognition gait analysis masked autoencoder affective computing
Online Access:	https://ieeexplore.ieee.org/document/10852297/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1823859610823426048
author	Pedro Victor Vieira Paiva Josue Junior Guimaraes Ramos Marina Gavrilova Marco Antonio Garcia de Carvalho
author_facet	Pedro Victor Vieira Paiva Josue Junior Guimaraes Ramos Marina Gavrilova Marco Antonio Garcia de Carvalho
author_sort	Pedro Victor Vieira Paiva
collection	DOAJ
description	Emotion recognition plays an essential role in human-computer interaction, spanning diverse domains from human-robot communication and virtual reality to mental health assessment and affective computing. Traditionally, this field has heavily relied on visual and auditory cues, such as facial expressions and speech analysis. However, these modalities alone may not comprehensively capture the full spectrum of human emotion and suffer limitations due to noise or occlusion. Human skeletons, derived from depth sensors or pose estimation algorithms, offer an alternative for facial expression, including valuable spatial and temporal cues. In this paper, we introduce a novel approach to emotion recognition by pre-training a transformer model on a large dataset of unsupervised human skeleton representations and subsequently fine-tuning it for emotion classification. By exposing the model to an extensive corpus of unlabeled human skeleton data, we can effectively learn to represent complex spatial and temporal dependencies inherent in body movements. Following this foundational knowledge acquisition, the model undergoes fine-tuning on a smaller, labeled dataset tailored for emotion classification tasks. We introduce SkelETT, an encoder-only transformer architecture for body emotion recognition. Comprising a series of encoder layers, SkelETT patches 2D body pose representations, it also includes multi-head self-attention mechanisms and position-wise feed-forward networks, providing a powerful framework for extracting hierarchical features from sequential body pose data. We propose and evaluate the impact of different fine-tuning strategies on pose data using the MPOSE action recognition dataset as a pre-training source. Transfer performance is measured on the BoLD body emotion recognition dataset. Compared to the state-of-the-art, we report significant gains in accuracy (<inline-formula> <tex-math notation="LaTeX">$\approx ~34$ </tex-math></inline-formula>% higher), training time (<inline-formula> <tex-math notation="LaTeX">$\approx ~50$ </tex-math></inline-formula>% less), and model complexity reduction (<inline-formula> <tex-math notation="LaTeX">$\approx ~80$ </tex-math></inline-formula>% less trainable parameters).
format	Article
id	doaj-art-5a6896acbef94353a500b11e3719f10f
institution	Kabale University
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-5a6896acbef94353a500b11e3719f10f2025-02-11T00:01:31ZengIEEEIEEE Access2169-35362025-01-0113233442335810.1109/ACCESS.2025.353414510852297SkelETT—Skeleton-to-Emotion Transfer TransformerPedro Victor Vieira Paiva0https://orcid.org/0000-0002-3743-1985Josue Junior Guimaraes Ramos1https://orcid.org/0000-0002-5815-2424Marina Gavrilova2https://orcid.org/0000-0002-5338-1834Marco Antonio Garcia de Carvalho3https://orcid.org/0000-0002-1941-6036School of Technology, University of Campinas, Limeira, BrazilRenato Archer IT Center, Campinas, BrazilDepartment of Computer Science, University of Calgary, Calgary, AB, CanadaSchool of Technology, University of Campinas, Limeira, BrazilEmotion recognition plays an essential role in human-computer interaction, spanning diverse domains from human-robot communication and virtual reality to mental health assessment and affective computing. Traditionally, this field has heavily relied on visual and auditory cues, such as facial expressions and speech analysis. However, these modalities alone may not comprehensively capture the full spectrum of human emotion and suffer limitations due to noise or occlusion. Human skeletons, derived from depth sensors or pose estimation algorithms, offer an alternative for facial expression, including valuable spatial and temporal cues. In this paper, we introduce a novel approach to emotion recognition by pre-training a transformer model on a large dataset of unsupervised human skeleton representations and subsequently fine-tuning it for emotion classification. By exposing the model to an extensive corpus of unlabeled human skeleton data, we can effectively learn to represent complex spatial and temporal dependencies inherent in body movements. Following this foundational knowledge acquisition, the model undergoes fine-tuning on a smaller, labeled dataset tailored for emotion classification tasks. We introduce SkelETT, an encoder-only transformer architecture for body emotion recognition. Comprising a series of encoder layers, SkelETT patches 2D body pose representations, it also includes multi-head self-attention mechanisms and position-wise feed-forward networks, providing a powerful framework for extracting hierarchical features from sequential body pose data. We propose and evaluate the impact of different fine-tuning strategies on pose data using the MPOSE action recognition dataset as a pre-training source. Transfer performance is measured on the BoLD body emotion recognition dataset. Compared to the state-of-the-art, we report significant gains in accuracy (<inline-formula> <tex-math notation="LaTeX">$\approx ~34$ </tex-math></inline-formula>% higher), training time (<inline-formula> <tex-math notation="LaTeX">$\approx ~50$ </tex-math></inline-formula>% less), and model complexity reduction (<inline-formula> <tex-math notation="LaTeX">$\approx ~80$ </tex-math></inline-formula>% less trainable parameters).https://ieeexplore.ieee.org/document/10852297/Attention-based designbody emotion recognitiongait analysismasked autoencoderaffective computing
spellingShingle	Pedro Victor Vieira Paiva Josue Junior Guimaraes Ramos Marina Gavrilova Marco Antonio Garcia de Carvalho SkelETT—Skeleton-to-Emotion Transfer Transformer IEEE Access Attention-based design body emotion recognition gait analysis masked autoencoder affective computing
title	SkelETT—Skeleton-to-Emotion Transfer Transformer
title_full	SkelETT—Skeleton-to-Emotion Transfer Transformer
title_fullStr	SkelETT—Skeleton-to-Emotion Transfer Transformer
title_full_unstemmed	SkelETT—Skeleton-to-Emotion Transfer Transformer
title_short	SkelETT—Skeleton-to-Emotion Transfer Transformer
title_sort	skelett x2014 skeleton to emotion transfer transformer
topic	Attention-based design body emotion recognition gait analysis masked autoencoder affective computing
url	https://ieeexplore.ieee.org/document/10852297/
work_keys_str_mv	AT pedrovictorvieirapaiva skelettx2014skeletontoemotiontransfertransformer AT josuejuniorguimaraesramos skelettx2014skeletontoemotiontransfertransformer AT marinagavrilova skelettx2014skeletontoemotiontransfertransformer AT marcoantoniogarciadecarvalho skelettx2014skeletontoemotiontransfertransformer

SkelETT&#x2014;Skeleton-to-Emotion Transfer Transformer

Similar Items

SkelETT—Skeleton-to-Emotion Transfer Transformer