Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques

An important hurdle in medical diagnostics is the high-quality and interpretable classification of audio signals. In this study, we present an image-based representation of infant crying audio files to predict abnormal infant cries using a vision transformer and also show significant improvements in...

Full description

Saved in:
Bibliographic Details
Main Authors: Sari Masri, Ahmad Hasasneh, Mohammad Tami, Chakib Tadj
Format: Article
Language:English
Published: MDPI AG 2024-11-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/15/12/751
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850059070362353664
author Sari Masri
Ahmad Hasasneh
Mohammad Tami
Chakib Tadj
author_facet Sari Masri
Ahmad Hasasneh
Mohammad Tami
Chakib Tadj
author_sort Sari Masri
collection DOAJ
description An important hurdle in medical diagnostics is the high-quality and interpretable classification of audio signals. In this study, we present an image-based representation of infant crying audio files to predict abnormal infant cries using a vision transformer and also show significant improvements in the performance and interpretability of this computer-aided tool. The use of advanced feature extraction techniques such as Gammatone Frequency Cepstral Coefficients (GFCCs) resulted in a classification accuracy of 96.33%. For other features (spectrogram and mel-spectrogram), the performance was very similar, with an accuracy of 93.17% for the spectrogram and 94.83% accuracy for the mel-spectrogram. We used our vision transformer (ViT) model, which is less complex but more effective than the proposed audio spectrogram transformer (AST). We incorporated explainable AI (XAI) techniques such as Layer-wise Relevance Propagation (LRP), Local Interpretable Model-agnostic Explanations (LIME), and attention mechanisms to ensure transparency and reliability in decision-making, which helped us understand the why of model predictions. The accuracy of detection was higher than previously reported and the results were easy to interpret, demonstrating that this work can potentially serve as a new benchmark for audio classification tasks, especially in medical diagnostics, and providing better prospects for an imminent future of trustworthy AI-based healthcare solutions.
format Article
id doaj-art-e1b96dd50e6d466090b316b7bdd28abe
institution DOAJ
issn 2078-2489
language English
publishDate 2024-11-01
publisher MDPI AG
record_format Article
series Information
spelling doaj-art-e1b96dd50e6d466090b316b7bdd28abe2025-08-20T02:50:59ZengMDPI AGInformation2078-24892024-11-01151275110.3390/info15120751Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI TechniquesSari Masri0Ahmad Hasasneh1Mohammad Tami2Chakib Tadj3Department of Natural, Engineering and Technology Sciences, Faculty of Graduate Studies, Arab American University, Ramallah P.O. Box 240, PalestineDepartment of Natural, Engineering and Technology Sciences, Faculty of Graduate Studies, Arab American University, Ramallah P.O. Box 240, PalestineDepartment of Natural, Engineering and Technology Sciences, Faculty of Graduate Studies, Arab American University, Ramallah P.O. Box 240, PalestineDepartment of Electrical Engineering, École de Technologie Supérieur, Université du Québec, Montreal, QC H3C 1K3, CanadaAn important hurdle in medical diagnostics is the high-quality and interpretable classification of audio signals. In this study, we present an image-based representation of infant crying audio files to predict abnormal infant cries using a vision transformer and also show significant improvements in the performance and interpretability of this computer-aided tool. The use of advanced feature extraction techniques such as Gammatone Frequency Cepstral Coefficients (GFCCs) resulted in a classification accuracy of 96.33%. For other features (spectrogram and mel-spectrogram), the performance was very similar, with an accuracy of 93.17% for the spectrogram and 94.83% accuracy for the mel-spectrogram. We used our vision transformer (ViT) model, which is less complex but more effective than the proposed audio spectrogram transformer (AST). We incorporated explainable AI (XAI) techniques such as Layer-wise Relevance Propagation (LRP), Local Interpretable Model-agnostic Explanations (LIME), and attention mechanisms to ensure transparency and reliability in decision-making, which helped us understand the why of model predictions. The accuracy of detection was higher than previously reported and the results were easy to interpret, demonstrating that this work can potentially serve as a new benchmark for audio classification tasks, especially in medical diagnostics, and providing better prospects for an imminent future of trustworthy AI-based healthcare solutions.https://www.mdpi.com/2078-2489/15/12/751vision transformers (ViTs)infant cry classificationaudio signalsimage-based representationsgammatone frequency cepstral coefficients (GFCCs)spectrogram
spellingShingle Sari Masri
Ahmad Hasasneh
Mohammad Tami
Chakib Tadj
Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques
Information
vision transformers (ViTs)
infant cry classification
audio signals
image-based representations
gammatone frequency cepstral coefficients (GFCCs)
spectrogram
title Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques
title_full Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques
title_fullStr Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques
title_full_unstemmed Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques
title_short Exploring the Impact of Image-Based Audio Representations in Classification Tasks Using Vision Transformers and Explainable AI Techniques
title_sort exploring the impact of image based audio representations in classification tasks using vision transformers and explainable ai techniques
topic vision transformers (ViTs)
infant cry classification
audio signals
image-based representations
gammatone frequency cepstral coefficients (GFCCs)
spectrogram
url https://www.mdpi.com/2078-2489/15/12/751
work_keys_str_mv AT sarimasri exploringtheimpactofimagebasedaudiorepresentationsinclassificationtasksusingvisiontransformersandexplainableaitechniques
AT ahmadhasasneh exploringtheimpactofimagebasedaudiorepresentationsinclassificationtasksusingvisiontransformersandexplainableaitechniques
AT mohammadtami exploringtheimpactofimagebasedaudiorepresentationsinclassificationtasksusingvisiontransformersandexplainableaitechniques
AT chakibtadj exploringtheimpactofimagebasedaudiorepresentationsinclassificationtasksusingvisiontransformersandexplainableaitechniques