Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks
This paper extends our previous work on cleft lip and/or palate (CL/P) classification, which employed vision transformers (ViTs) and Siamese neural networks. We now integrate SigLIP 2, a state-of-the-art multilingual vision–language model, for feature extraction, replacing the previously utilized Bi...
Saved in:
| Main Authors: | , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-04-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/9/4766 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850278551871291392 |
|---|---|
| author | Oraphan Nantha Benjaporn Sathanarugsawait Prasong Praneetpolgrang |
| author_facet | Oraphan Nantha Benjaporn Sathanarugsawait Prasong Praneetpolgrang |
| author_sort | Oraphan Nantha |
| collection | DOAJ |
| description | This paper extends our previous work on cleft lip and/or palate (CL/P) classification, which employed vision transformers (ViTs) and Siamese neural networks. We now integrate SigLIP 2, a state-of-the-art multilingual vision–language model, for feature extraction, replacing the previously utilized BiomedCLIP. SigLIP 2 offers enhanced semantic understanding, improved localization capabilities, and multilingual support, potentially leading to more robust feature representations for CL/P classification. We hypothesize that SigLIP 2’s superior feature extraction will improve the classification accuracy of CL/P types (bilateral, unilateral, and palate-only) from the UltraSuite CLEFT dataset, a collection of ultrasound video sequences capturing tongue movements during speech with synchronized audio recordings. A comparative analysis is conducted, evaluating the performance of our original ViT-Siamese network model (using BiomedCLIP) against a new model leveraging SigLIP 2 for feature extraction. Performance is assessed using accuracy, precision, recall, and F1 score, demonstrating the impact of SigLIP 2 on CL/P classification. The new model achieves statistically significant improvements in overall accuracy (86.6% vs. 82.76%) and F1 scores for all cleft types. We discuss the computational efficiency and practical implications of employing SigLIP 2 in a clinical setting, highlighting its potential for earlier and more accurate diagnosis, personalized treatment planning, and broader applicability across diverse populations. The results demonstrate the significant potential of advanced vision–language models, such as SigLIP 2, to enhance AI-powered medical diagnostics. |
| format | Article |
| id | doaj-art-2d2ca7cefeda4427a31babb8a04b200d |
| institution | OA Journals |
| issn | 2076-3417 |
| language | English |
| publishDate | 2025-04-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-2d2ca7cefeda4427a31babb8a04b200d2025-08-20T01:49:27ZengMDPI AGApplied Sciences2076-34172025-04-01159476610.3390/app15094766Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese NetworksOraphan Nantha0Benjaporn Sathanarugsawait1Prasong Praneetpolgrang2School of Information Technology, Sripatum University, Bangkok 10900, ThailandSchool of Information Technology, Sripatum University, Bangkok 10900, ThailandSchool of Information Technology, Sripatum University, Bangkok 10900, ThailandThis paper extends our previous work on cleft lip and/or palate (CL/P) classification, which employed vision transformers (ViTs) and Siamese neural networks. We now integrate SigLIP 2, a state-of-the-art multilingual vision–language model, for feature extraction, replacing the previously utilized BiomedCLIP. SigLIP 2 offers enhanced semantic understanding, improved localization capabilities, and multilingual support, potentially leading to more robust feature representations for CL/P classification. We hypothesize that SigLIP 2’s superior feature extraction will improve the classification accuracy of CL/P types (bilateral, unilateral, and palate-only) from the UltraSuite CLEFT dataset, a collection of ultrasound video sequences capturing tongue movements during speech with synchronized audio recordings. A comparative analysis is conducted, evaluating the performance of our original ViT-Siamese network model (using BiomedCLIP) against a new model leveraging SigLIP 2 for feature extraction. Performance is assessed using accuracy, precision, recall, and F1 score, demonstrating the impact of SigLIP 2 on CL/P classification. The new model achieves statistically significant improvements in overall accuracy (86.6% vs. 82.76%) and F1 scores for all cleft types. We discuss the computational efficiency and practical implications of employing SigLIP 2 in a clinical setting, highlighting its potential for earlier and more accurate diagnosis, personalized treatment planning, and broader applicability across diverse populations. The results demonstrate the significant potential of advanced vision–language models, such as SigLIP 2, to enhance AI-powered medical diagnostics.https://www.mdpi.com/2076-3417/15/9/4766cleft lip and palatevision–language modelsfew-shot learningmedical image analysisAI in healthcare |
| spellingShingle | Oraphan Nantha Benjaporn Sathanarugsawait Prasong Praneetpolgrang Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks Applied Sciences cleft lip and palate vision–language models few-shot learning medical image analysis AI in healthcare |
| title | Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks |
| title_full | Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks |
| title_fullStr | Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks |
| title_full_unstemmed | Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks |
| title_short | Enhanced Cleft Lip and Palate Classification Using SigLIP 2: A Comparative Study with Vision Transformers and Siamese Networks |
| title_sort | enhanced cleft lip and palate classification using siglip 2 a comparative study with vision transformers and siamese networks |
| topic | cleft lip and palate vision–language models few-shot learning medical image analysis AI in healthcare |
| url | https://www.mdpi.com/2076-3417/15/9/4766 |
| work_keys_str_mv | AT oraphannantha enhancedcleftlipandpalateclassificationusingsiglip2acomparativestudywithvisiontransformersandsiamesenetworks AT benjapornsathanarugsawait enhancedcleftlipandpalateclassificationusingsiglip2acomparativestudywithvisiontransformersandsiamesenetworks AT prasongpraneetpolgrang enhancedcleftlipandpalateclassificationusingsiglip2acomparativestudywithvisiontransformersandsiamesenetworks |