A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition

Facial expression recognition (FER) plays a critical role in affective computing, enabling machines to interpret human emotions through facial cues. While recent deep learning models have achieved progress, many still fail under real-world conditions such as occlusion, lighting variation, and subtle...

Full description

Saved in:
Bibliographic Details
Main Authors: Zarnigor Tagmatova, Sabina Umirzakova, Alpamis Kutlimuratov, Akmalbek Abdusalomov, Young Im Cho
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/13/7100
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849704574511742976
author Zarnigor Tagmatova
Sabina Umirzakova
Alpamis Kutlimuratov
Akmalbek Abdusalomov
Young Im Cho
author_facet Zarnigor Tagmatova
Sabina Umirzakova
Alpamis Kutlimuratov
Akmalbek Abdusalomov
Young Im Cho
author_sort Zarnigor Tagmatova
collection DOAJ
description Facial expression recognition (FER) plays a critical role in affective computing, enabling machines to interpret human emotions through facial cues. While recent deep learning models have achieved progress, many still fail under real-world conditions such as occlusion, lighting variation, and subtle expressions. In this work, we propose FERONet, a novel hyper-attentive multimodal transformer architecture tailored for robust and real-time FER. FERONet integrates a triple-attention mechanism (spatial, channel, and cross-patch), a hierarchical transformer with token merging for computational efficiency, and a temporal cross-attention decoder to model emotional dynamics in video sequences. The model fuses RGB, optical flow, and depth/landmark inputs, enhancing resilience to environmental variation. Experimental evaluations across five standard FER datasets—FER-2013, RAF-DB, CK+, BU-3DFE, and AFEW—show that FERONet achieves superior recognition accuracy (up to 97.3%) and real-time inference speeds (<16 ms per frame), outperforming prior state-of-the-art models. The results confirm the model’s suitability for deployment in applications such as intelligent tutoring, driver monitoring, and clinical emotion assessment.
format Article
id doaj-art-928a0ddf30e2439b9b0d93e58b8eb735
institution DOAJ
issn 2076-3417
language English
publishDate 2025-06-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-928a0ddf30e2439b9b0d93e58b8eb7352025-08-20T03:16:43ZengMDPI AGApplied Sciences2076-34172025-06-011513710010.3390/app15137100A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression RecognitionZarnigor Tagmatova0Sabina Umirzakova1Alpamis Kutlimuratov2Akmalbek Abdusalomov3Young Im Cho4Department of Computer Engineering, Gachon University, Sujeong-Gu, Seongnam-si 13120, Gyeonggi-Do, Republic of KoreaDepartment of Computer Engineering, Gachon University, Sujeong-Gu, Seongnam-si 13120, Gyeonggi-Do, Republic of KoreaDepartment of Applied Informatics, Kimyo International University in Tashkent, Tashkent 100121, UzbekistanDepartment of Computer Systems, Tashkent University of Information Technologies Named After Muhammad Al-Khwarizmi, Tashkent 100200, UzbekistanDepartment of Computer Engineering, Gachon University, Sujeong-Gu, Seongnam-si 13120, Gyeonggi-Do, Republic of KoreaFacial expression recognition (FER) plays a critical role in affective computing, enabling machines to interpret human emotions through facial cues. While recent deep learning models have achieved progress, many still fail under real-world conditions such as occlusion, lighting variation, and subtle expressions. In this work, we propose FERONet, a novel hyper-attentive multimodal transformer architecture tailored for robust and real-time FER. FERONet integrates a triple-attention mechanism (spatial, channel, and cross-patch), a hierarchical transformer with token merging for computational efficiency, and a temporal cross-attention decoder to model emotional dynamics in video sequences. The model fuses RGB, optical flow, and depth/landmark inputs, enhancing resilience to environmental variation. Experimental evaluations across five standard FER datasets—FER-2013, RAF-DB, CK+, BU-3DFE, and AFEW—show that FERONet achieves superior recognition accuracy (up to 97.3%) and real-time inference speeds (<16 ms per frame), outperforming prior state-of-the-art models. The results confirm the model’s suitability for deployment in applications such as intelligent tutoring, driver monitoring, and clinical emotion assessment.https://www.mdpi.com/2076-3417/15/13/7100facial expression recognitionmultimodal transformertemporal modelingcross-attentionreal-time emotion recognitionhuman–computer interaction
spellingShingle Zarnigor Tagmatova
Sabina Umirzakova
Alpamis Kutlimuratov
Akmalbek Abdusalomov
Young Im Cho
A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition
Applied Sciences
facial expression recognition
multimodal transformer
temporal modeling
cross-attention
real-time emotion recognition
human–computer interaction
title A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition
title_full A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition
title_fullStr A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition
title_full_unstemmed A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition
title_short A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition
title_sort hyper attentive multimodal transformer for real time and robust facial expression recognition
topic facial expression recognition
multimodal transformer
temporal modeling
cross-attention
real-time emotion recognition
human–computer interaction
url https://www.mdpi.com/2076-3417/15/13/7100
work_keys_str_mv AT zarnigortagmatova ahyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition
AT sabinaumirzakova ahyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition
AT alpamiskutlimuratov ahyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition
AT akmalbekabdusalomov ahyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition
AT youngimcho ahyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition
AT zarnigortagmatova hyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition
AT sabinaumirzakova hyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition
AT alpamiskutlimuratov hyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition
AT akmalbekabdusalomov hyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition
AT youngimcho hyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition