EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification
The rapid advancement of medical imaging technologies requires the development of advanced, automated, and interpretable diagnostic tools for clinical decision-making. Although convolutional neural networks (CNNs) have shown significant promise in medical image analysis, they have limitations in cap...
Saved in:
| Main Authors: | , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
IEEE
2025-01-01
|
| Series: | IEEE Access |
| Subjects: | |
| Online Access: | https://ieeexplore.ieee.org/document/10938132/ |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849704640771260416 |
|---|---|
| author | Tahir Hussain Hayaru Shouno Abid Hussain Dostdar Hussain Muhammad Ismail Tatheer Hussain Mir Fang Rong Hsu Taukir Alam Shabnur Anonna Akhy |
| author_facet | Tahir Hussain Hayaru Shouno Abid Hussain Dostdar Hussain Muhammad Ismail Tatheer Hussain Mir Fang Rong Hsu Taukir Alam Shabnur Anonna Akhy |
| author_sort | Tahir Hussain |
| collection | DOAJ |
| description | The rapid advancement of medical imaging technologies requires the development of advanced, automated, and interpretable diagnostic tools for clinical decision-making. Although convolutional neural networks (CNNs) have shown significant promise in medical image analysis, they have limitations in capturing the global context and lack interpretability, thereby hindering their clinical adoption. This study presents EFFResNet-ViT, a novel hybrid deep learning (DL) model designed to address these challenges by combining EfficientNet-B0 and ResNet-50 CNN backbones with a vision transformer (ViT) module. The proposed architecture employs a feature fusion strategy to integrate the local feature extraction strengths of CNNs with the global dependency modeling capabilities of transformers. The extracted features are further refined through a post-transformer CNN and a global average pooling layer to enhance the classification performance. To improve interpretability, EFFResNet-ViT incorporates Grad-CAM visualization techniques to highlight regions contributing to classification decisions and employs t-distributed stochastic neighbor embedding for feature space analysis, providing insights into class separability. The proposed model was evaluated on two benchmark datasets: brain tumor (BT) CE-MRI for BT classification and a retinal image dataset for ophthalmological diagnosis. EFFResNet-ViT achieved state-of-the-art performance, with accuracies of 99.31% and 92.54% on the BT CE-MRI and retinal datasets, respectively. Comparative analyses demonstrate the superior classification performance and interpretability of EFFResNet-ViT over existing ViT and CNN-based hybrid models. The explainable design of EFFResNet-ViT addresses the critical need for transparency in artificial intelligence-driven medical diagnostics, facilitating its potential integration into clinical workflows to improve decision-making and patient outcomes. |
| format | Article |
| id | doaj-art-7b66f8d79854442bbe7e55130d772b97 |
| institution | DOAJ |
| issn | 2169-3536 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | IEEE |
| record_format | Article |
| series | IEEE Access |
| spelling | doaj-art-7b66f8d79854442bbe7e55130d772b972025-08-20T03:16:42ZengIEEEIEEE Access2169-35362025-01-0113540405406810.1109/ACCESS.2025.355418410938132EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image ClassificationTahir Hussain0https://orcid.org/0009-0005-7937-6485Hayaru Shouno1https://orcid.org/0000-0002-2412-0184Abid Hussain2Dostdar Hussain3https://orcid.org/0000-0002-8972-7622Muhammad Ismail4https://orcid.org/0000-0001-7162-5700Tatheer Hussain Mir5https://orcid.org/0009-0007-8409-8189Fang Rong Hsu6https://orcid.org/0000-0001-9791-317XTaukir Alam7https://orcid.org/0000-0003-3353-5338Shabnur Anonna Akhy8Department of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo, JapanDepartment of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo, JapanSchool of Microelectronics, University of Science and Technology China, Hefei, Anhui, ChinaDepartment of Computer Sciences, Karakoram International University, Gilgit, PakistanDepartment of Computer Sciences, Karakoram International University, Gilgit, PakistanIntelligent System Laboratory, Department of Electrical Engineering, College of Electrical Engineering and Computer Science, National Kaohsiung University of Science and Technology, Kaohsiung, TaiwanDepartment of Information Engineering and Computer Science, Feng Chia University, Taichung, TaiwanDepartment of Information Engineering and Computer Science, Feng Chia University, Taichung, TaiwanDepartment of Informatics, Graduate School of Informatics and Engineering, The University of Electro-Communications, Tokyo, JapanThe rapid advancement of medical imaging technologies requires the development of advanced, automated, and interpretable diagnostic tools for clinical decision-making. Although convolutional neural networks (CNNs) have shown significant promise in medical image analysis, they have limitations in capturing the global context and lack interpretability, thereby hindering their clinical adoption. This study presents EFFResNet-ViT, a novel hybrid deep learning (DL) model designed to address these challenges by combining EfficientNet-B0 and ResNet-50 CNN backbones with a vision transformer (ViT) module. The proposed architecture employs a feature fusion strategy to integrate the local feature extraction strengths of CNNs with the global dependency modeling capabilities of transformers. The extracted features are further refined through a post-transformer CNN and a global average pooling layer to enhance the classification performance. To improve interpretability, EFFResNet-ViT incorporates Grad-CAM visualization techniques to highlight regions contributing to classification decisions and employs t-distributed stochastic neighbor embedding for feature space analysis, providing insights into class separability. The proposed model was evaluated on two benchmark datasets: brain tumor (BT) CE-MRI for BT classification and a retinal image dataset for ophthalmological diagnosis. EFFResNet-ViT achieved state-of-the-art performance, with accuracies of 99.31% and 92.54% on the BT CE-MRI and retinal datasets, respectively. Comparative analyses demonstrate the superior classification performance and interpretability of EFFResNet-ViT over existing ViT and CNN-based hybrid models. The explainable design of EFFResNet-ViT addresses the critical need for transparency in artificial intelligence-driven medical diagnostics, facilitating its potential integration into clinical workflows to improve decision-making and patient outcomes.https://ieeexplore.ieee.org/document/10938132/EfficientNet-B0ResNet-50vision transformermodel explainabilityt-distributed stochastic neighbor embedding |
| spellingShingle | Tahir Hussain Hayaru Shouno Abid Hussain Dostdar Hussain Muhammad Ismail Tatheer Hussain Mir Fang Rong Hsu Taukir Alam Shabnur Anonna Akhy EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification IEEE Access EfficientNet-B0 ResNet-50 vision transformer model explainability t-distributed stochastic neighbor embedding |
| title | EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification |
| title_full | EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification |
| title_fullStr | EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification |
| title_full_unstemmed | EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification |
| title_short | EFFResNet-ViT: A Fusion-Based Convolutional and Vision Transformer Model for Explainable Medical Image Classification |
| title_sort | effresnet vit a fusion based convolutional and vision transformer model for explainable medical image classification |
| topic | EfficientNet-B0 ResNet-50 vision transformer model explainability t-distributed stochastic neighbor embedding |
| url | https://ieeexplore.ieee.org/document/10938132/ |
| work_keys_str_mv | AT tahirhussain effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT hayarushouno effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT abidhussain effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT dostdarhussain effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT muhammadismail effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT tatheerhussainmir effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT fangronghsu effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT taukiralam effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification AT shabnuranonnaakhy effresnetvitafusionbasedconvolutionalandvisiontransformermodelforexplainablemedicalimageclassification |