SkelETT—Skeleton-to-Emotion Transfer Transformer

Emotion recognition plays an essential role in human-computer interaction, spanning diverse domains from human-robot communication and virtual reality to mental health assessment and affective computing. Traditionally, this field has heavily relied on visual and auditory cues, such as facial express...

Full description

Saved in:
Bibliographic Details
Main Authors: Pedro Victor Vieira Paiva, Josue Junior Guimaraes Ramos, Marina Gavrilova, Marco Antonio Garcia de Carvalho
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10852297/
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Emotion recognition plays an essential role in human-computer interaction, spanning diverse domains from human-robot communication and virtual reality to mental health assessment and affective computing. Traditionally, this field has heavily relied on visual and auditory cues, such as facial expressions and speech analysis. However, these modalities alone may not comprehensively capture the full spectrum of human emotion and suffer limitations due to noise or occlusion. Human skeletons, derived from depth sensors or pose estimation algorithms, offer an alternative for facial expression, including valuable spatial and temporal cues. In this paper, we introduce a novel approach to emotion recognition by pre-training a transformer model on a large dataset of unsupervised human skeleton representations and subsequently fine-tuning it for emotion classification. By exposing the model to an extensive corpus of unlabeled human skeleton data, we can effectively learn to represent complex spatial and temporal dependencies inherent in body movements. Following this foundational knowledge acquisition, the model undergoes fine-tuning on a smaller, labeled dataset tailored for emotion classification tasks. We introduce SkelETT, an encoder-only transformer architecture for body emotion recognition. Comprising a series of encoder layers, SkelETT patches 2D body pose representations, it also includes multi-head self-attention mechanisms and position-wise feed-forward networks, providing a powerful framework for extracting hierarchical features from sequential body pose data. We propose and evaluate the impact of different fine-tuning strategies on pose data using the MPOSE action recognition dataset as a pre-training source. Transfer performance is measured on the BoLD body emotion recognition dataset. Compared to the state-of-the-art, we report significant gains in accuracy (<inline-formula> <tex-math notation="LaTeX">$\approx ~34$ </tex-math></inline-formula>% higher), training time (<inline-formula> <tex-math notation="LaTeX">$\approx ~50$ </tex-math></inline-formula>% less), and model complexity reduction (<inline-formula> <tex-math notation="LaTeX">$\approx ~80$ </tex-math></inline-formula>% less trainable parameters).
ISSN:2169-3536