Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques
An important hurdle in medical diagnostics is the high-quality and interpretable classification of audio signals. In this study, we present an image-based representation of infant crying audio files to predict abnormal infant cries using a vision transformer and also show significant improvements in...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2024-11-01
|
| Series: | Information |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2078-2489/15/12/751 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850059070362353664 |
|---|---|
| author | Sari Masri Ahmad Hasasneh Mohammad Tami Chakib Tadj |
| author_facet | Sari Masri Ahmad Hasasneh Mohammad Tami Chakib Tadj |
| author_sort | Sari Masri |
| collection | DOAJ |
| description | An important hurdle in medical diagnostics is the high-quality and interpretable classification of audio signals. In this study, we present an image-based representation of infant crying audio files to predict abnormal infant cries using a vision transformer and also show significant improvements in the performance and interpretability of this computer-aided tool. The use of advanced feature extraction techniques such as Gammatone Frequency Cepstral Coefficients (GFCCs) resulted in a classification accuracy of 96.33%. For other features (spectrogram and mel-spectrogram), the performance was very similar, with an accuracy of 93.17% for the spectrogram and 94.83% accuracy for the mel-spectrogram. We used our vision transformer (ViT) model, which is less complex but more effective than the proposed audio spectrogram transformer (AST). We incorporated explainable AI (XAI) techniques such as Layer-wise Relevance Propagation (LRP), Local Interpretable Model-agnostic Explanations (LIME), and attention mechanisms to ensure transparency and reliability in decision-making, which helped us understand the why of model predictions. The accuracy of detection was higher than previously reported and the results were easy to interpret, demonstrating that this work can potentially serve as a new benchmark for audio classification tasks, especially in medical diagnostics, and providing better prospects for an imminent future of trustworthy AI-based healthcare solutions. |
| format | Article |
| id | doaj-art-e1b96dd50e6d466090b316b7bdd28abe |
| institution | DOAJ |
| issn | 2078-2489 |
| language | English |
| publishDate | 2024-11-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Information |
| spelling | doaj-art-e1b96dd50e6d466090b316b7bdd28abe2025-08-20T02:50:59ZengMDPI AGInformation2078-24892024-11-01151275110.3390/info15120751Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI TechniquesSari Masri0Ahmad Hasasneh1Mohammad Tami2Chakib Tadj3Department of Natural, Engineering and Technology Sciences, Faculty of Graduate Studies, Arab American University, Ramallah P.O. Box 240, PalestineDepartment of Natural, Engineering and Technology Sciences, Faculty of Graduate Studies, Arab American University, Ramallah P.O. Box 240, PalestineDepartment of Natural, Engineering and Technology Sciences, Faculty of Graduate Studies, Arab American University, Ramallah P.O. Box 240, PalestineDepartment of Electrical Engineering, École de Technologie Supérieur, Université du Québec, Montreal, QC H3C 1K3, CanadaAn important hurdle in medical diagnostics is the high-quality and interpretable classification of audio signals. In this study, we present an image-based representation of infant crying audio files to predict abnormal infant cries using a vision transformer and also show significant improvements in the performance and interpretability of this computer-aided tool. The use of advanced feature extraction techniques such as Gammatone Frequency Cepstral Coefficients (GFCCs) resulted in a classification accuracy of 96.33%. For other features (spectrogram and mel-spectrogram), the performance was very similar, with an accuracy of 93.17% for the spectrogram and 94.83% accuracy for the mel-spectrogram. We used our vision transformer (ViT) model, which is less complex but more effective than the proposed audio spectrogram transformer (AST). We incorporated explainable AI (XAI) techniques such as Layer-wise Relevance Propagation (LRP), Local Interpretable Model-agnostic Explanations (LIME), and attention mechanisms to ensure transparency and reliability in decision-making, which helped us understand the why of model predictions. The accuracy of detection was higher than previously reported and the results were easy to interpret, demonstrating that this work can potentially serve as a new benchmark for audio classification tasks, especially in medical diagnostics, and providing better prospects for an imminent future of trustworthy AI-based healthcare solutions.https://www.mdpi.com/2078-2489/15/12/751vision transformers (ViTs)infant cry classificationaudio signalsimage-based representationsgammatone frequency cepstral coefficients (GFCCs)spectrogram |
| spellingShingle | Sari Masri Ahmad Hasasneh Mohammad Tami Chakib Tadj Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques Information vision transformers (ViTs) infant cry classification audio signals image-based representations gammatone frequency cepstral coefficients (GFCCs) spectrogram |
| title | Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques |
| title_full | Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques |
| title_fullStr | Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques |
| title_full_unstemmed | Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques |
| title_short | Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques |
| title_sort | exploring the impact of image based audio representations in classification tasks using vision transformers and explainable ai techniques |
| topic | vision transformers (ViTs) infant cry classification audio signals image-based representations gammatone frequency cepstral coefficients (GFCCs) spectrogram |
| url | https://www.mdpi.com/2078-2489/15/12/751 |
| work_keys_str_mv | AT sarimasri exploringtheimpactofimagebasedaudiorepresentationsinclassificationtasksusingvisiontransformersandexplainableaitechniques AT ahmadhasasneh exploringtheimpactofimagebasedaudiorepresentationsinclassificationtasksusingvisiontransformersandexplainableaitechniques AT mohammadtami exploringtheimpactofimagebasedaudiorepresentationsinclassificationtasksusingvisiontransformersandexplainableaitechniques AT chakibtadj exploringtheimpactofimagebasedaudiorepresentationsinclassificationtasksusingvisiontransformersandexplainableaitechniques |