A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition
Facial expression recognition (FER) plays a critical role in affective computing, enabling machines to interpret human emotions through facial cues. While recent deep learning models have achieved progress, many still fail under real-world conditions such as occlusion, lighting variation, and subtle...
Saved in:
| Main Authors: | , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-06-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/13/7100 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849704574511742976 |
|---|---|
| author | Zarnigor Tagmatova Sabina Umirzakova Alpamis Kutlimuratov Akmalbek Abdusalomov Young Im Cho |
| author_facet | Zarnigor Tagmatova Sabina Umirzakova Alpamis Kutlimuratov Akmalbek Abdusalomov Young Im Cho |
| author_sort | Zarnigor Tagmatova |
| collection | DOAJ |
| description | Facial expression recognition (FER) plays a critical role in affective computing, enabling machines to interpret human emotions through facial cues. While recent deep learning models have achieved progress, many still fail under real-world conditions such as occlusion, lighting variation, and subtle expressions. In this work, we propose FERONet, a novel hyper-attentive multimodal transformer architecture tailored for robust and real-time FER. FERONet integrates a triple-attention mechanism (spatial, channel, and cross-patch), a hierarchical transformer with token merging for computational efficiency, and a temporal cross-attention decoder to model emotional dynamics in video sequences. The model fuses RGB, optical flow, and depth/landmark inputs, enhancing resilience to environmental variation. Experimental evaluations across five standard FER datasets—FER-2013, RAF-DB, CK+, BU-3DFE, and AFEW—show that FERONet achieves superior recognition accuracy (up to 97.3%) and real-time inference speeds (<16 ms per frame), outperforming prior state-of-the-art models. The results confirm the model’s suitability for deployment in applications such as intelligent tutoring, driver monitoring, and clinical emotion assessment. |
| format | Article |
| id | doaj-art-928a0ddf30e2439b9b0d93e58b8eb735 |
| institution | DOAJ |
| issn | 2076-3417 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-928a0ddf30e2439b9b0d93e58b8eb7352025-08-20T03:16:43ZengMDPI AGApplied Sciences2076-34172025-06-011513710010.3390/app15137100A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression RecognitionZarnigor Tagmatova0Sabina Umirzakova1Alpamis Kutlimuratov2Akmalbek Abdusalomov3Young Im Cho4Department of Computer Engineering, Gachon University, Sujeong-Gu, Seongnam-si 13120, Gyeonggi-Do, Republic of KoreaDepartment of Computer Engineering, Gachon University, Sujeong-Gu, Seongnam-si 13120, Gyeonggi-Do, Republic of KoreaDepartment of Applied Informatics, Kimyo International University in Tashkent, Tashkent 100121, UzbekistanDepartment of Computer Systems, Tashkent University of Information Technologies Named After Muhammad Al-Khwarizmi, Tashkent 100200, UzbekistanDepartment of Computer Engineering, Gachon University, Sujeong-Gu, Seongnam-si 13120, Gyeonggi-Do, Republic of KoreaFacial expression recognition (FER) plays a critical role in affective computing, enabling machines to interpret human emotions through facial cues. While recent deep learning models have achieved progress, many still fail under real-world conditions such as occlusion, lighting variation, and subtle expressions. In this work, we propose FERONet, a novel hyper-attentive multimodal transformer architecture tailored for robust and real-time FER. FERONet integrates a triple-attention mechanism (spatial, channel, and cross-patch), a hierarchical transformer with token merging for computational efficiency, and a temporal cross-attention decoder to model emotional dynamics in video sequences. The model fuses RGB, optical flow, and depth/landmark inputs, enhancing resilience to environmental variation. Experimental evaluations across five standard FER datasets—FER-2013, RAF-DB, CK+, BU-3DFE, and AFEW—show that FERONet achieves superior recognition accuracy (up to 97.3%) and real-time inference speeds (<16 ms per frame), outperforming prior state-of-the-art models. The results confirm the model’s suitability for deployment in applications such as intelligent tutoring, driver monitoring, and clinical emotion assessment.https://www.mdpi.com/2076-3417/15/13/7100facial expression recognitionmultimodal transformertemporal modelingcross-attentionreal-time emotion recognitionhuman–computer interaction |
| spellingShingle | Zarnigor Tagmatova Sabina Umirzakova Alpamis Kutlimuratov Akmalbek Abdusalomov Young Im Cho A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition Applied Sciences facial expression recognition multimodal transformer temporal modeling cross-attention real-time emotion recognition human–computer interaction |
| title | A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition |
| title_full | A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition |
| title_fullStr | A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition |
| title_full_unstemmed | A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition |
| title_short | A Hyper-Attentive Multimodal Transformer for Real-Time and Robust Facial Expression Recognition |
| title_sort | hyper attentive multimodal transformer for real time and robust facial expression recognition |
| topic | facial expression recognition multimodal transformer temporal modeling cross-attention real-time emotion recognition human–computer interaction |
| url | https://www.mdpi.com/2076-3417/15/13/7100 |
| work_keys_str_mv | AT zarnigortagmatova ahyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition AT sabinaumirzakova ahyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition AT alpamiskutlimuratov ahyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition AT akmalbekabdusalomov ahyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition AT youngimcho ahyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition AT zarnigortagmatova hyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition AT sabinaumirzakova hyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition AT alpamiskutlimuratov hyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition AT akmalbekabdusalomov hyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition AT youngimcho hyperattentivemultimodaltransformerforrealtimeandrobustfacialexpressionrecognition |