Image-Based Diagnostic Performance of LLMs vs CNNs for Oral Lichen Planus: Example-Guided and Differential Diagnosis

Introduction and aims: The overlapping characteristics of oral lichen planus (OLP), a chronic oral mucosal inflammatory condition, with those of other oral lesions, present diagnostic challenges. Large language models (LLMs) with integrated computer-vision capabilities and convolutional neural netwo...

Full description

Saved in:
Bibliographic Details
Main Authors: Paak Rewthamrongsris, Jirayu Burapacheep, Ekarat Phattarataratip, Promphakkon Kulthanaamondhita, Antonin Tichy, Falk Schwendicke, Thanaphum Osathanon, Kraisorn Sappayatosok
Format: Article
Language:English
Published: Elsevier 2025-08-01
Series:International Dental Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S0020653925001376
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849718152558018560
author Paak Rewthamrongsris
Jirayu Burapacheep
Ekarat Phattarataratip
Promphakkon Kulthanaamondhita
Antonin Tichy
Falk Schwendicke
Thanaphum Osathanon
Kraisorn Sappayatosok
author_facet Paak Rewthamrongsris
Jirayu Burapacheep
Ekarat Phattarataratip
Promphakkon Kulthanaamondhita
Antonin Tichy
Falk Schwendicke
Thanaphum Osathanon
Kraisorn Sappayatosok
author_sort Paak Rewthamrongsris
collection DOAJ
description Introduction and aims: The overlapping characteristics of oral lichen planus (OLP), a chronic oral mucosal inflammatory condition, with those of other oral lesions, present diagnostic challenges. Large language models (LLMs) with integrated computer-vision capabilities and convolutional neural networks (CNNs) constitute an alternative diagnostic modality. We evaluated the ability of seven LLMs, including both proprietary and open-source models, to detect OLP from intraoral images and generate differential diagnoses. Methods: Using a dataset with 1,142 clinical photographs of histopathologically confirmed OLP, non-OLP lesions, and normal mucosa. The LLMs were tested using three experimental designs: zero-shot recognition, example-guided recognition, and differential diagnosis. Performance was measured using accuracy, precision, recall, F1-score, and discounted cumulative gain (DCG). Furthermore, the performance of LLMs was compared with three previously published CNN-based models for OLP detection on a subset of 110 photographs, which were previously used to test the CNN models. Results: Gemini 1.5 Pro and Flash demonstrated the highest accuracy (69.69%) in zero-shot recognition, whereas GPT-4o ranked first in the F1 score (76.10%). With example-guided prompts, which improved consistency and reduced refusal rates, Gemini 1.5 Flash achieved the highest accuracy (80.53%) and F1-score (84.54%); however, Claude 3.5 Sonnet achieved the highest DCG score of 0.63. Although the proprietary models generally excelled, the open-source Llama model demonstrated notable strengths in ranking relevant diagnoses despite moderate performance in detection tasks. All LLMs were outperformed by the CNN models. Conclusion: The seven evaluated LLMs lack sufficient performance for clinical use. CNNs trained to detect OLP outperformed the LLMs tested in this study.
format Article
id doaj-art-5a28e029509240d4b16e4cf8ca345d21
institution DOAJ
issn 0020-6539
language English
publishDate 2025-08-01
publisher Elsevier
record_format Article
series International Dental Journal
spelling doaj-art-5a28e029509240d4b16e4cf8ca345d212025-08-20T03:12:27ZengElsevierInternational Dental Journal0020-65392025-08-0175410084810.1016/j.identj.2025.100848Image-Based Diagnostic Performance of LLMs vs CNNs for Oral Lichen Planus: Example-Guided and Differential DiagnosisPaak Rewthamrongsris0Jirayu Burapacheep1Ekarat Phattarataratip2Promphakkon Kulthanaamondhita3Antonin Tichy4Falk Schwendicke5Thanaphum Osathanon6Kraisorn Sappayatosok7Center of Artificial Intelligence and Innovation (CAII) and Center of Excellence for Dental Stem Cell Biology, Faculty of Dentistry, Chulalongkorn University, Bangkok, Thailand; Department of Conservative Dentistry and Periodontology, LMU University Hospital, LMU Munich, GermanyDepartment of Computer Science, Stanford University, Stanford, California, USADepartment of Oral Pathology, Faculty of Dentistry, Chulalongkorn University, Bangkok, ThailandCollege of Dental Medicine, Rangsit University, Pathum Thani, ThailandDepartment of Conservative Dentistry and Periodontology, LMU University Hospital, LMU Munich, Germany; Institute of Dental Medicine, First Faculty of Medicine, Charles University, Prague, Czech RepublicDepartment of Conservative Dentistry and Periodontology, LMU University Hospital, LMU Munich, GermanyCenter of Artificial Intelligence and Innovation (CAII) and Center of Excellence for Dental Stem Cell Biology, Faculty of Dentistry, Chulalongkorn University, Bangkok, ThailandCollege of Dental Medicine, Rangsit University, Pathum Thani, Thailand; Corresponding author. College of Dental Medicine, Rangsit University, Pathum Thani, 12000 Thailand.Introduction and aims: The overlapping characteristics of oral lichen planus (OLP), a chronic oral mucosal inflammatory condition, with those of other oral lesions, present diagnostic challenges. Large language models (LLMs) with integrated computer-vision capabilities and convolutional neural networks (CNNs) constitute an alternative diagnostic modality. We evaluated the ability of seven LLMs, including both proprietary and open-source models, to detect OLP from intraoral images and generate differential diagnoses. Methods: Using a dataset with 1,142 clinical photographs of histopathologically confirmed OLP, non-OLP lesions, and normal mucosa. The LLMs were tested using three experimental designs: zero-shot recognition, example-guided recognition, and differential diagnosis. Performance was measured using accuracy, precision, recall, F1-score, and discounted cumulative gain (DCG). Furthermore, the performance of LLMs was compared with three previously published CNN-based models for OLP detection on a subset of 110 photographs, which were previously used to test the CNN models. Results: Gemini 1.5 Pro and Flash demonstrated the highest accuracy (69.69%) in zero-shot recognition, whereas GPT-4o ranked first in the F1 score (76.10%). With example-guided prompts, which improved consistency and reduced refusal rates, Gemini 1.5 Flash achieved the highest accuracy (80.53%) and F1-score (84.54%); however, Claude 3.5 Sonnet achieved the highest DCG score of 0.63. Although the proprietary models generally excelled, the open-source Llama model demonstrated notable strengths in ranking relevant diagnoses despite moderate performance in detection tasks. All LLMs were outperformed by the CNN models. Conclusion: The seven evaluated LLMs lack sufficient performance for clinical use. CNNs trained to detect OLP outperformed the LLMs tested in this study.http://www.sciencedirect.com/science/article/pii/S0020653925001376ChatbotComputer-assisted diagnosisDifferential diagnosisGenerative artificial intelligenceLarge language modelOral lichen planus
spellingShingle Paak Rewthamrongsris
Jirayu Burapacheep
Ekarat Phattarataratip
Promphakkon Kulthanaamondhita
Antonin Tichy
Falk Schwendicke
Thanaphum Osathanon
Kraisorn Sappayatosok
Image-Based Diagnostic Performance of LLMs vs CNNs for Oral Lichen Planus: Example-Guided and Differential Diagnosis
International Dental Journal
Chatbot
Computer-assisted diagnosis
Differential diagnosis
Generative artificial intelligence
Large language model
Oral lichen planus
title Image-Based Diagnostic Performance of LLMs vs CNNs for Oral Lichen Planus: Example-Guided and Differential Diagnosis
title_full Image-Based Diagnostic Performance of LLMs vs CNNs for Oral Lichen Planus: Example-Guided and Differential Diagnosis
title_fullStr Image-Based Diagnostic Performance of LLMs vs CNNs for Oral Lichen Planus: Example-Guided and Differential Diagnosis
title_full_unstemmed Image-Based Diagnostic Performance of LLMs vs CNNs for Oral Lichen Planus: Example-Guided and Differential Diagnosis
title_short Image-Based Diagnostic Performance of LLMs vs CNNs for Oral Lichen Planus: Example-Guided and Differential Diagnosis
title_sort image based diagnostic performance of llms vs cnns for oral lichen planus example guided and differential diagnosis
topic Chatbot
Computer-assisted diagnosis
Differential diagnosis
Generative artificial intelligence
Large language model
Oral lichen planus
url http://www.sciencedirect.com/science/article/pii/S0020653925001376
work_keys_str_mv AT paakrewthamrongsris imagebaseddiagnosticperformanceofllmsvscnnsfororallichenplanusexampleguidedanddifferentialdiagnosis
AT jirayuburapacheep imagebaseddiagnosticperformanceofllmsvscnnsfororallichenplanusexampleguidedanddifferentialdiagnosis
AT ekaratphattarataratip imagebaseddiagnosticperformanceofllmsvscnnsfororallichenplanusexampleguidedanddifferentialdiagnosis
AT promphakkonkulthanaamondhita imagebaseddiagnosticperformanceofllmsvscnnsfororallichenplanusexampleguidedanddifferentialdiagnosis
AT antonintichy imagebaseddiagnosticperformanceofllmsvscnnsfororallichenplanusexampleguidedanddifferentialdiagnosis
AT falkschwendicke imagebaseddiagnosticperformanceofllmsvscnnsfororallichenplanusexampleguidedanddifferentialdiagnosis
AT thanaphumosathanon imagebaseddiagnosticperformanceofllmsvscnnsfororallichenplanusexampleguidedanddifferentialdiagnosis
AT kraisornsappayatosok imagebaseddiagnosticperformanceofllmsvscnnsfororallichenplanusexampleguidedanddifferentialdiagnosis