Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks

This paper extends our previous work on cleft lip and/or palate (CL/P) classification, which employed vision transformers (ViTs) and Siamese neural networks. We now integrate SigLIP 2, a state-of-the-art multilingual vision–language model, for feature extraction, replacing the previously utilized Bi...

Full description

Saved in:
Bibliographic Details
Main Authors: Oraphan Nantha, Benjaporn Sathanarugsawait, Prasong Praneetpolgrang
Format: Article
Language:English
Published: MDPI AG 2025-04-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/9/4766
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850278551871291392
author Oraphan Nantha
Benjaporn Sathanarugsawait
Prasong Praneetpolgrang
author_facet Oraphan Nantha
Benjaporn Sathanarugsawait
Prasong Praneetpolgrang
author_sort Oraphan Nantha
collection DOAJ
description This paper extends our previous work on cleft lip and/or palate (CL/P) classification, which employed vision transformers (ViTs) and Siamese neural networks. We now integrate SigLIP 2, a state-of-the-art multilingual vision–language model, for feature extraction, replacing the previously utilized BiomedCLIP. SigLIP 2 offers enhanced semantic understanding, improved localization capabilities, and multilingual support, potentially leading to more robust feature representations for CL/P classification. We hypothesize that SigLIP 2’s superior feature extraction will improve the classification accuracy of CL/P types (bilateral, unilateral, and palate-only) from the UltraSuite CLEFT dataset, a collection of ultrasound video sequences capturing tongue movements during speech with synchronized audio recordings. A comparative analysis is conducted, evaluating the performance of our original ViT-Siamese network model (using BiomedCLIP) against a new model leveraging SigLIP 2 for feature extraction. Performance is assessed using accuracy, precision, recall, and F1 score, demonstrating the impact of SigLIP 2 on CL/P classification. The new model achieves statistically significant improvements in overall accuracy (86.6% vs. 82.76%) and F1 scores for all cleft types. We discuss the computational efficiency and practical implications of employing SigLIP 2 in a clinical setting, highlighting its potential for earlier and more accurate diagnosis, personalized treatment planning, and broader applicability across diverse populations. The results demonstrate the significant potential of advanced vision–language models, such as SigLIP 2, to enhance AI-powered medical diagnostics.
format Article
id doaj-art-2d2ca7cefeda4427a31babb8a04b200d
institution OA Journals
issn 2076-3417
language English
publishDate 2025-04-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-2d2ca7cefeda4427a31babb8a04b200d2025-08-20T01:49:27ZengMDPI AGApplied Sciences2076-34172025-04-01159476610.3390/app15094766Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese NetworksOraphan Nantha0Benjaporn Sathanarugsawait1Prasong Praneetpolgrang2School of Information Technology, Sripatum University, Bangkok 10900, ThailandSchool of Information Technology, Sripatum University, Bangkok 10900, ThailandSchool of Information Technology, Sripatum University, Bangkok 10900, ThailandThis paper extends our previous work on cleft lip and/or palate (CL/P) classification, which employed vision transformers (ViTs) and Siamese neural networks. We now integrate SigLIP 2, a state-of-the-art multilingual vision–language model, for feature extraction, replacing the previously utilized BiomedCLIP. SigLIP 2 offers enhanced semantic understanding, improved localization capabilities, and multilingual support, potentially leading to more robust feature representations for CL/P classification. We hypothesize that SigLIP 2’s superior feature extraction will improve the classification accuracy of CL/P types (bilateral, unilateral, and palate-only) from the UltraSuite CLEFT dataset, a collection of ultrasound video sequences capturing tongue movements during speech with synchronized audio recordings. A comparative analysis is conducted, evaluating the performance of our original ViT-Siamese network model (using BiomedCLIP) against a new model leveraging SigLIP 2 for feature extraction. Performance is assessed using accuracy, precision, recall, and F1 score, demonstrating the impact of SigLIP 2 on CL/P classification. The new model achieves statistically significant improvements in overall accuracy (86.6% vs. 82.76%) and F1 scores for all cleft types. We discuss the computational efficiency and practical implications of employing SigLIP 2 in a clinical setting, highlighting its potential for earlier and more accurate diagnosis, personalized treatment planning, and broader applicability across diverse populations. The results demonstrate the significant potential of advanced vision–language models, such as SigLIP 2, to enhance AI-powered medical diagnostics.https://www.mdpi.com/2076-3417/15/9/4766cleft lip and palatevision–language modelsfew-shot learningmedical image analysisAI in healthcare
spellingShingle Oraphan Nantha
Benjaporn Sathanarugsawait
Prasong Praneetpolgrang
Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks
Applied Sciences
cleft lip and palate
vision–language models
few-shot learning
medical image analysis
AI in healthcare
title Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks
title_full Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks
title_fullStr Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks
title_full_unstemmed Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks
title_short Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks
title_sort enhanced cleft lip and palate classification using siglip 2 a comparative study with vision transformers and siamese networks
topic cleft lip and palate
vision–language models
few-shot learning
medical image analysis
AI in healthcare
url https://www.mdpi.com/2076-3417/15/9/4766
work_keys_str_mv AT oraphannantha enhancedcleftlipandpalateclassificationusingsiglip2acomparativestudywithvisiontransformersandsiamesenetworks
AT benjapornsathanarugsawait enhancedcleftlipandpalateclassificationusingsiglip2acomparativestudywithvisiontransformersandsiamesenetworks
AT prasongpraneetpolgrang enhancedcleftlipandpalateclassificationusingsiglip2acomparativestudywithvisiontransformersandsiamesenetworks