EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification

The rapid advancement of medical imaging technologies requires the development of advanced, automated, and interpretable diagnostic tools for clinical decision-making. Although convolutional neural networks (CNNs) have shown significant promise in medical image analysis, they have limitations in cap...

Full description

Saved in:
Bibliographic Details
Main Authors: Tahir Hussain, Hayaru Shouno, Abid Hussain, Dostdar Hussain, Muhammad Ismail, Tatheer Hussain Mir, Fang Rong Hsu, Taukir Alam, Shabnur Anonna Akhy
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10938132/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849704640771260416
author Tahir Hussain
Hayaru Shouno
Abid Hussain
Dostdar Hussain
Muhammad Ismail
Tatheer Hussain Mir
Fang Rong Hsu
Taukir Alam
Shabnur Anonna Akhy
author_facet Tahir Hussain
Hayaru Shouno
Abid Hussain
Dostdar Hussain
Muhammad Ismail
Tatheer Hussain Mir
Fang Rong Hsu
Taukir Alam
Shabnur Anonna Akhy
author_sort Tahir Hussain
collection DOAJ
description The rapid advancement of medical imaging technologies requires the development of advanced, automated, and interpretable diagnostic tools for clinical decision-making. Although convolutional neural networks (CNNs) have shown significant promise in medical image analysis, they have limitations in capturing the global context and lack interpretability, thereby hindering their clinical adoption. This study presents EFFResNet-ViT, a novel hybrid deep learning (DL) model designed to address these challenges by combining EfficientNet-B0 and ResNet-50 CNN backbones with a vision transformer (ViT) module. The proposed architecture employs a feature fusion strategy to integrate the local feature extraction strengths of CNNs with the global dependency modeling capabilities of transformers. The extracted features are further refined through a post-transformer CNN and a global average pooling layer to enhance the classification performance. To improve interpretability, EFFResNet-ViT incorporates Grad-CAM visualization techniques to highlight regions contributing to classification decisions and employs t-distributed stochastic neighbor embedding for feature space analysis, providing insights into class separability. The proposed model was evaluated on two benchmark datasets: brain tumor (BT) CE-MRI for BT classification and a retinal image dataset for ophthalmological diagnosis. EFFResNet-ViT achieved state-of-the-art performance, with accuracies of 99.31% and 92.54% on the BT CE-MRI and retinal datasets, respectively. Comparative analyses demonstrate the superior classification performance and interpretability of EFFResNet-ViT over existing ViT and CNN-based hybrid models. The explainable design of EFFResNet-ViT addresses the critical need for transparency in artificial intelligence-driven medical diagnostics, facilitating its potential integration into clinical workflows to improve decision-making and patient outcomes.
format Article
id doaj-art-7b66f8d79854442bbe7e55130d772b97
institution DOAJ
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-7b66f8d79854442bbe7e55130d772b972025-08-20T03:16:42ZengIEEEIEEE Access2169-35362025-01-0113540405406810.1109/ACCESS.2025.355418410938132EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image ClassificationTahir Hussain0https://orcid.org/0009-0005-7937-6485Hayaru Shouno1https://orcid.org/0000-0002-2412-0184Abid Hussain2Dostdar Hussain3https://orcid.org/0000-0002-8972-7622Muhammad Ismail4https://orcid.org/0000-0001-7162-5700Tatheer Hussain Mir5https://orcid.org/0009-0007-8409-8189Fang Rong Hsu6https://orcid.org/0000-0001-9791-317XTaukir Alam7https://orcid.org/0000-0003-3353-5338Shabnur Anonna Akhy8Department of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo, JapanDepartment of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo, JapanSchool of Microelectronics, University of Science and Technology China, Hefei, Anhui, ChinaDepartment of Computer Sciences, Karakoram International University, Gilgit, PakistanDepartment of Computer Sciences, Karakoram International University, Gilgit, PakistanIntelligent System Laboratory, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung, TaiwanDepartment of Information Engineering and Computer Science, Feng Chia University, Taichung, TaiwanDepartment of Information Engineering and Computer Science, Feng Chia University, Taichung, TaiwanDepartment of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo, JapanThe rapid advancement of medical imaging technologies requires the development of advanced, automated, and interpretable diagnostic tools for clinical decision-making. Although convolutional neural networks (CNNs) have shown significant promise in medical image analysis, they have limitations in capturing the global context and lack interpretability, thereby hindering their clinical adoption. This study presents EFFResNet-ViT, a novel hybrid deep learning (DL) model designed to address these challenges by combining EfficientNet-B0 and ResNet-50 CNN backbones with a vision transformer (ViT) module. The proposed architecture employs a feature fusion strategy to integrate the local feature extraction strengths of CNNs with the global dependency modeling capabilities of transformers. The extracted features are further refined through a post-transformer CNN and a global average pooling layer to enhance the classification performance. To improve interpretability, EFFResNet-ViT incorporates Grad-CAM visualization techniques to highlight regions contributing to classification decisions and employs t-distributed stochastic neighbor embedding for feature space analysis, providing insights into class separability. The proposed model was evaluated on two benchmark datasets: brain tumor (BT) CE-MRI for BT classification and a retinal image dataset for ophthalmological diagnosis. EFFResNet-ViT achieved state-of-the-art performance, with accuracies of 99.31% and 92.54% on the BT CE-MRI and retinal datasets, respectively. Comparative analyses demonstrate the superior classification performance and interpretability of EFFResNet-ViT over existing ViT and CNN-based hybrid models. The explainable design of EFFResNet-ViT addresses the critical need for transparency in artificial intelligence-driven medical diagnostics, facilitating its potential integration into clinical workflows to improve decision-making and patient outcomes.https://ieeexplore.ieee.org/document/10938132/EfficientNet-B0ResNet-50vision transformermodel explainabilityt-distributed stochastic neighbor embedding
spellingShingle Tahir Hussain
Hayaru Shouno
Abid Hussain
Dostdar Hussain
Muhammad Ismail
Tatheer Hussain Mir
Fang Rong Hsu
Taukir Alam
Shabnur Anonna Akhy
EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
IEEE Access
EfficientNet-B0
ResNet-50
vision transformer
model explainability
t-distributed stochastic neighbor embedding
title EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
title_full EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
title_fullStr EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
title_full_unstemmed EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
title_short EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
title_sort effresnet vit a fusion based convolutional and vision transformer model for explainable medical image classification
topic EfficientNet-B0
ResNet-50
vision transformer
model explainability
t-distributed stochastic neighbor embedding
url https://ieeexplore.ieee.org/document/10938132/
work_keys_str_mv AT tahirhussain effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification
AT hayarushouno effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification
AT abidhussain effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification
AT dostdarhussain effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification
AT muhammadismail effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification
AT tatheerhussainmir effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification
AT fangronghsu effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification
AT taukiralam effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification
AT shabnuranonnaakhy effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification