Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human Performance

BackgroundThe development of automatic emotion recognition models from smartphone videos is a crucial step toward the dissemination of psychotherapeutic app interventions that encourage emotional expressions. Existing models focus mainly on the 6 basic emotions while neglecti...

Full description

Saved in:
Bibliographic Details
Main Authors: Marie Keinert, Simon Pistrosch, Adria Mallol-Ragolta, Björn W Schuller, Matthias Berking
Format: Article
Language:English
Published: JMIR Publications 2025-07-01
Series:Journal of Medical Internet Research
Online Access:https://www.jmir.org/2025/1/e68942
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849422283124244480
author Marie Keinert
Simon Pistrosch
Adria Mallol-Ragolta
Björn W Schuller
Matthias Berking
author_facet Marie Keinert
Simon Pistrosch
Adria Mallol-Ragolta
Björn W Schuller
Matthias Berking
author_sort Marie Keinert
collection DOAJ
description BackgroundThe development of automatic emotion recognition models from smartphone videos is a crucial step toward the dissemination of psychotherapeutic app interventions that encourage emotional expressions. Existing models focus mainly on the 6 basic emotions while neglecting other therapeutically relevant emotions. To support this research, we introduce the novel Stress Reduction Training Through the Recognition of Emotions Wizard-of-Oz (STREs WoZ) dataset, which contains facial videos of 16 distinct, therapeutically relevant emotions. ObjectiveThis study aimed to develop deep learning–based automatic facial emotion recognition (FER) models for binary (positive vs negative) and multiclass emotion classification tasks, assess the models’ performance, and validate them by comparing the models with human observers. MethodsThe STREs WoZ dataset contains 14,412 facial videos of 63 individuals displaying the 16 emotions. The selfie-style videos were recorded during a stress reduction training using front-facing smartphone cameras in a nonconstrained laboratory setting. Automatic FER models using both appearance and deep-learned features for binary and multiclass emotion classification were trained on the STREs WoZ dataset. The appearance features were based on the Facial Action Coding System and extracted with OpenFace. The deep-learned features were obtained through a ResNet50 model. For our deep learning models, we used the appearance features, the deep-learned features, and their concatenation as inputs. We used 3 recurrent neural network (RNN)–based architectures: RNN-convolution, RNN-attention, and RNN-average networks. For validation, 3 human observers were also trained in binary and multiclass emotion recognition. A test set of 3018 facial emotion videos of the 16 emotions was completed by both the automatic FER model and human observers. The performance was assessed with unweighted average recall (UAR) and accuracy. ResultsModels using appearance features outperformed those using deep-learned features, as well as models combining both feature types in both tasks, with the attention network using appearance features emerging as the best-performing model. The attention network achieved a UAR of 92.9% in the binary classification task, and accuracy values ranged from 59.0% to 90.0% in the multiclass classification task. Human performance was comparable to that of the automatic FER model in the binary classification task, with a UAR of 91.0%, and superior in the multiclass classification task, with accuracy values ranging from 87.4% to 99.8%. ConclusionsFuture studies are needed to enhance the performance of automatic FER models for practical use in psychotherapeutic apps. Nevertheless, this study represents an important first step toward advancing emotion-focused psychotherapeutic interventions via smartphone apps.
format Article
id doaj-art-b1ab529a62a6400a9b723bc4cc0a4141
institution Kabale University
issn 1438-8871
language English
publishDate 2025-07-01
publisher JMIR Publications
record_format Article
series Journal of Medical Internet Research
spelling doaj-art-b1ab529a62a6400a9b723bc4cc0a41412025-08-20T03:31:10ZengJMIR PublicationsJournal of Medical Internet Research1438-88712025-07-0127e6894210.2196/68942Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human PerformanceMarie Keinerthttps://orcid.org/0000-0002-7427-3058Simon Pistroschhttps://orcid.org/0009-0002-6763-7981Adria Mallol-Ragoltahttps://orcid.org/0000-0001-6855-485XBjörn W Schullerhttps://orcid.org/0000-0002-6478-8699Matthias Berkinghttps://orcid.org/0000-0001-5903-4748 BackgroundThe development of automatic emotion recognition models from smartphone videos is a crucial step toward the dissemination of psychotherapeutic app interventions that encourage emotional expressions. Existing models focus mainly on the 6 basic emotions while neglecting other therapeutically relevant emotions. To support this research, we introduce the novel Stress Reduction Training Through the Recognition of Emotions Wizard-of-Oz (STREs WoZ) dataset, which contains facial videos of 16 distinct, therapeutically relevant emotions. ObjectiveThis study aimed to develop deep learning–based automatic facial emotion recognition (FER) models for binary (positive vs negative) and multiclass emotion classification tasks, assess the models’ performance, and validate them by comparing the models with human observers. MethodsThe STREs WoZ dataset contains 14,412 facial videos of 63 individuals displaying the 16 emotions. The selfie-style videos were recorded during a stress reduction training using front-facing smartphone cameras in a nonconstrained laboratory setting. Automatic FER models using both appearance and deep-learned features for binary and multiclass emotion classification were trained on the STREs WoZ dataset. The appearance features were based on the Facial Action Coding System and extracted with OpenFace. The deep-learned features were obtained through a ResNet50 model. For our deep learning models, we used the appearance features, the deep-learned features, and their concatenation as inputs. We used 3 recurrent neural network (RNN)–based architectures: RNN-convolution, RNN-attention, and RNN-average networks. For validation, 3 human observers were also trained in binary and multiclass emotion recognition. A test set of 3018 facial emotion videos of the 16 emotions was completed by both the automatic FER model and human observers. The performance was assessed with unweighted average recall (UAR) and accuracy. ResultsModels using appearance features outperformed those using deep-learned features, as well as models combining both feature types in both tasks, with the attention network using appearance features emerging as the best-performing model. The attention network achieved a UAR of 92.9% in the binary classification task, and accuracy values ranged from 59.0% to 90.0% in the multiclass classification task. Human performance was comparable to that of the automatic FER model in the binary classification task, with a UAR of 91.0%, and superior in the multiclass classification task, with accuracy values ranging from 87.4% to 99.8%. ConclusionsFuture studies are needed to enhance the performance of automatic FER models for practical use in psychotherapeutic apps. Nevertheless, this study represents an important first step toward advancing emotion-focused psychotherapeutic interventions via smartphone apps.https://www.jmir.org/2025/1/e68942
spellingShingle Marie Keinert
Simon Pistrosch
Adria Mallol-Ragolta
Björn W Schuller
Matthias Berking
Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human Performance
Journal of Medical Internet Research
title Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human Performance
title_full Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human Performance
title_fullStr Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human Performance
title_full_unstemmed Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human Performance
title_short Facial Emotion Recognition of 16 Distinct Emotions From Smartphone Videos: Comparative Study of Machine Learning and Human Performance
title_sort facial emotion recognition of 16 distinct emotions from smartphone videos comparative study of machine learning and human performance
url https://www.jmir.org/2025/1/e68942
work_keys_str_mv AT mariekeinert facialemotionrecognitionof16distinctemotionsfromsmartphonevideoscomparativestudyofmachinelearningandhumanperformance
AT simonpistrosch facialemotionrecognitionof16distinctemotionsfromsmartphonevideoscomparativestudyofmachinelearningandhumanperformance
AT adriamallolragolta facialemotionrecognitionof16distinctemotionsfromsmartphonevideoscomparativestudyofmachinelearningandhumanperformance
AT bjornwschuller facialemotionrecognitionof16distinctemotionsfromsmartphonevideoscomparativestudyofmachinelearningandhumanperformance
AT matthiasberking facialemotionrecognitionof16distinctemotionsfromsmartphonevideoscomparativestudyofmachinelearningandhumanperformance