Applying a Convolutional Vision Transformer for Emotion Recognition in Children with Autism: Fusion of Facial Expressions and Speech Features

With advances in digital technology, including deep learning and big data analytics, new methods have been developed for autism diagnosis and intervention. Emotion recognition and the detection of autism in children are prominent subjects in autism research. Typically using single-modal data to anal...

Full description

Saved in:
Bibliographic Details
Main Authors: Yonggu Wang, Kailin Pan, Yifan Shao, Jiarong Ma, Xiaojuan Li
Format: Article
Language:English
Published: MDPI AG 2025-03-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/6/3083
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850204325937152000
author Yonggu Wang
Kailin Pan
Yifan Shao
Jiarong Ma
Xiaojuan Li
author_facet Yonggu Wang
Kailin Pan
Yifan Shao
Jiarong Ma
Xiaojuan Li
author_sort Yonggu Wang
collection DOAJ
description With advances in digital technology, including deep learning and big data analytics, new methods have been developed for autism diagnosis and intervention. Emotion recognition and the detection of autism in children are prominent subjects in autism research. Typically using single-modal data to analyze the emotional states of children with autism, previous research has found that the accuracy of recognition algorithms must be improved. Our study creates datasets on the facial and speech emotions of children with autism in their natural states. A convolutional vision transformer-based emotion recognition model is constructed for the two distinct datasets. The findings indicate that the model achieves accuracies of 79.12% and 83.47% for facial expression recognition and Mel spectrogram recognition, respectively. Consequently, we propose a multimodal data fusion strategy for emotion recognition and construct a feature fusion model based on an attention mechanism, which attains a recognition accuracy of 90.73%. Ultimately, by using gradient-weighted class activation mapping, a prediction heat map is produced to visualize facial expressions and speech features under four emotional states. This study offers a technical direction for the use of intelligent perception technology in the realm of special education and enriches the theory of emotional intelligence perception of children with autism.
format Article
id doaj-art-2d6f5dfb425848309f68a6d7749dfb02
institution OA Journals
issn 2076-3417
language English
publishDate 2025-03-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-2d6f5dfb425848309f68a6d7749dfb022025-08-20T02:11:18ZengMDPI AGApplied Sciences2076-34172025-03-01156308310.3390/app15063083Applying a Convolutional Vision Transformer for Emotion Recognition in Children with Autism: Fusion of Facial Expressions and Speech FeaturesYonggu Wang0Kailin Pan1Yifan Shao2Jiarong Ma3Xiaojuan Li4College of Education, Zhejiang University of Technology, Hangzhou 310023, ChinaCollege of Education, Zhejiang University of Technology, Hangzhou 310023, ChinaCollege of Education, Zhejiang University of Technology, Hangzhou 310023, ChinaCollege of Education, Zhejiang University of Technology, Hangzhou 310023, ChinaMental Health Education Centre, Zhejiang University of Finance and Economics, Hangzhou 310018, ChinaWith advances in digital technology, including deep learning and big data analytics, new methods have been developed for autism diagnosis and intervention. Emotion recognition and the detection of autism in children are prominent subjects in autism research. Typically using single-modal data to analyze the emotional states of children with autism, previous research has found that the accuracy of recognition algorithms must be improved. Our study creates datasets on the facial and speech emotions of children with autism in their natural states. A convolutional vision transformer-based emotion recognition model is constructed for the two distinct datasets. The findings indicate that the model achieves accuracies of 79.12% and 83.47% for facial expression recognition and Mel spectrogram recognition, respectively. Consequently, we propose a multimodal data fusion strategy for emotion recognition and construct a feature fusion model based on an attention mechanism, which attains a recognition accuracy of 90.73%. Ultimately, by using gradient-weighted class activation mapping, a prediction heat map is produced to visualize facial expressions and speech features under four emotional states. This study offers a technical direction for the use of intelligent perception technology in the realm of special education and enriches the theory of emotional intelligence perception of children with autism.https://www.mdpi.com/2076-3417/15/6/3083emotion recognitionmultimodal feature fusiondeep learningchildren with autism
spellingShingle Yonggu Wang
Kailin Pan
Yifan Shao
Jiarong Ma
Xiaojuan Li
Applying a Convolutional Vision Transformer for Emotion Recognition in Children with Autism: Fusion of Facial Expressions and Speech Features
Applied Sciences
emotion recognition
multimodal feature fusion
deep learning
children with autism
title Applying a Convolutional Vision Transformer for Emotion Recognition in Children with Autism: Fusion of Facial Expressions and Speech Features
title_full Applying a Convolutional Vision Transformer for Emotion Recognition in Children with Autism: Fusion of Facial Expressions and Speech Features
title_fullStr Applying a Convolutional Vision Transformer for Emotion Recognition in Children with Autism: Fusion of Facial Expressions and Speech Features
title_full_unstemmed Applying a Convolutional Vision Transformer for Emotion Recognition in Children with Autism: Fusion of Facial Expressions and Speech Features
title_short Applying a Convolutional Vision Transformer for Emotion Recognition in Children with Autism: Fusion of Facial Expressions and Speech Features
title_sort applying a convolutional vision transformer for emotion recognition in children with autism fusion of facial expressions and speech features
topic emotion recognition
multimodal feature fusion
deep learning
children with autism
url https://www.mdpi.com/2076-3417/15/6/3083
work_keys_str_mv AT yongguwang applyingaconvolutionalvisiontransformerforemotionrecognitioninchildrenwithautismfusionoffacialexpressionsandspeechfeatures
AT kailinpan applyingaconvolutionalvisiontransformerforemotionrecognitioninchildrenwithautismfusionoffacialexpressionsandspeechfeatures
AT yifanshao applyingaconvolutionalvisiontransformerforemotionrecognitioninchildrenwithautismfusionoffacialexpressionsandspeechfeatures
AT jiarongma applyingaconvolutionalvisiontransformerforemotionrecognitioninchildrenwithautismfusionoffacialexpressionsandspeechfeatures
AT xiaojuanli applyingaconvolutionalvisiontransformerforemotionrecognitioninchildrenwithautismfusionoffacialexpressionsandspeechfeatures