EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification

The rapid advancement of medical imaging technologies requires the development of advanced, automated, and interpretable diagnostic tools for clinical decision-making. Although convolutional neural networks (CNNs) have shown significant promise in medical image analysis, they have limitations in cap...

Full description

Saved in:

Bibliographic Details
Main Authors:	Tahir Hussain, Hayaru Shouno, Abid Hussain, Dostdar Hussain, Muhammad Ismail, Tatheer Hussain Mir, Fang Rong Hsu, Taukir Alam, Shabnur Anonna Akhy
Format:	Article
Language:	English
Published:	IEEE 2025-01-01
Series:	IEEE Access
Subjects:	EfficientNet-B0 ResNet-50 vision transformer model explainability t-distributed stochastic neighbor embedding
Online Access:	https://ieeexplore.ieee.org/document/10938132/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849704640771260416
author	Tahir Hussain Hayaru Shouno Abid Hussain Dostdar Hussain Muhammad Ismail Tatheer Hussain Mir Fang Rong Hsu Taukir Alam Shabnur Anonna Akhy
author_facet	Tahir Hussain Hayaru Shouno Abid Hussain Dostdar Hussain Muhammad Ismail Tatheer Hussain Mir Fang Rong Hsu Taukir Alam Shabnur Anonna Akhy
author_sort	Tahir Hussain
collection	DOAJ
description	The rapid advancement of medical imaging technologies requires the development of advanced, automated, and interpretable diagnostic tools for clinical decision-making. Although convolutional neural networks (CNNs) have shown significant promise in medical image analysis, they have limitations in capturing the global context and lack interpretability, thereby hindering their clinical adoption. This study presents EFFResNet-ViT, a novel hybrid deep learning (DL) model designed to address these challenges by combining EfficientNet-B0 and ResNet-50 CNN backbones with a vision transformer (ViT) module. The proposed architecture employs a feature fusion strategy to integrate the local feature extraction strengths of CNNs with the global dependency modeling capabilities of transformers. The extracted features are further refined through a post-transformer CNN and a global average pooling layer to enhance the classification performance. To improve interpretability, EFFResNet-ViT incorporates Grad-CAM visualization techniques to highlight regions contributing to classification decisions and employs t-distributed stochastic neighbor embedding for feature space analysis, providing insights into class separability. The proposed model was evaluated on two benchmark datasets: brain tumor (BT) CE-MRI for BT classification and a retinal image dataset for ophthalmological diagnosis. EFFResNet-ViT achieved state-of-the-art performance, with accuracies of 99.31% and 92.54% on the BT CE-MRI and retinal datasets, respectively. Comparative analyses demonstrate the superior classification performance and interpretability of EFFResNet-ViT over existing ViT and CNN-based hybrid models. The explainable design of EFFResNet-ViT addresses the critical need for transparency in artificial intelligence-driven medical diagnostics, facilitating its potential integration into clinical workflows to improve decision-making and patient outcomes.
format	Article
id	doaj-art-7b66f8d79854442bbe7e55130d772b97
institution	DOAJ
issn	2169-3536
language	English
publishDate	2025-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-7b66f8d79854442bbe7e55130d772b972025-08-20T03:16:42ZengIEEEIEEE Access2169-35362025-01-0113540405406810.1109/ACCESS.2025.355418410938132EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image ClassificationTahir Hussain0https://orcid.org/0009-0005-7937-6485Hayaru Shouno1https://orcid.org/0000-0002-2412-0184Abid Hussain2Dostdar Hussain3https://orcid.org/0000-0002-8972-7622Muhammad Ismail4https://orcid.org/0000-0001-7162-5700Tatheer Hussain Mir5https://orcid.org/0009-0007-8409-8189Fang Rong Hsu6https://orcid.org/0000-0001-9791-317XTaukir Alam7https://orcid.org/0000-0003-3353-5338Shabnur Anonna Akhy8Department of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo, JapanDepartment of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo, JapanSchool of Microelectronics, University of Science and Technology China, Hefei, Anhui, ChinaDepartment of Computer Sciences, Karakoram International University, Gilgit, PakistanDepartment of Computer Sciences, Karakoram International University, Gilgit, PakistanIntelligent System Laboratory, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung, TaiwanDepartment of Information Engineering and Computer Science, Feng Chia University, Taichung, TaiwanDepartment of Information Engineering and Computer Science, Feng Chia University, Taichung, TaiwanDepartment of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo, JapanThe rapid advancement of medical imaging technologies requires the development of advanced, automated, and interpretable diagnostic tools for clinical decision-making. Although convolutional neural networks (CNNs) have shown significant promise in medical image analysis, they have limitations in capturing the global context and lack interpretability, thereby hindering their clinical adoption. This study presents EFFResNet-ViT, a novel hybrid deep learning (DL) model designed to address these challenges by combining EfficientNet-B0 and ResNet-50 CNN backbones with a vision transformer (ViT) module. The proposed architecture employs a feature fusion strategy to integrate the local feature extraction strengths of CNNs with the global dependency modeling capabilities of transformers. The extracted features are further refined through a post-transformer CNN and a global average pooling layer to enhance the classification performance. To improve interpretability, EFFResNet-ViT incorporates Grad-CAM visualization techniques to highlight regions contributing to classification decisions and employs t-distributed stochastic neighbor embedding for feature space analysis, providing insights into class separability. The proposed model was evaluated on two benchmark datasets: brain tumor (BT) CE-MRI for BT classification and a retinal image dataset for ophthalmological diagnosis. EFFResNet-ViT achieved state-of-the-art performance, with accuracies of 99.31% and 92.54% on the BT CE-MRI and retinal datasets, respectively. Comparative analyses demonstrate the superior classification performance and interpretability of EFFResNet-ViT over existing ViT and CNN-based hybrid models. The explainable design of EFFResNet-ViT addresses the critical need for transparency in artificial intelligence-driven medical diagnostics, facilitating its potential integration into clinical workflows to improve decision-making and patient outcomes.https://ieeexplore.ieee.org/document/10938132/EfficientNet-B0ResNet-50vision transformermodel explainabilityt-distributed stochastic neighbor embedding
spellingShingle	Tahir Hussain Hayaru Shouno Abid Hussain Dostdar Hussain Muhammad Ismail Tatheer Hussain Mir Fang Rong Hsu Taukir Alam Shabnur Anonna Akhy EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification IEEE Access EfficientNet-B0 ResNet-50 vision transformer model explainability t-distributed stochastic neighbor embedding
title	EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
title_full	EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
title_fullStr	EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
title_full_unstemmed	EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
title_short	EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
title_sort	effresnet vit a fusion based convolutional and vision transformer model for explainable medical image classification
topic	EfficientNet-B0 ResNet-50 vision transformer model explainability t-distributed stochastic neighbor embedding
url	https://ieeexplore.ieee.org/document/10938132/
work_keys_str_mv	AT tahirhussain effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT hayarushouno effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT abidhussain effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT dostdarhussain effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT muhammadismail effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT tatheerhussainmir effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT fangronghsu effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT taukiralam effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT shabnuranonnaakhy effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification

EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification

Similar Items