Image-Based Diagnostic Performance of LLMs vs CNNs for Oral Lichen Planus: Example-Guided and Differential Diagnosis

Introduction and aims: The overlapping characteristics of oral lichen planus (OLP), a chronic oral mucosal inflammatory condition, with those of other oral lesions, present diagnostic challenges. Large language models (LLMs) with integrated computer-vision capabilities and convolutional neural netwo...

Full description

Saved in:
Bibliographic Details
Main Authors: Paak Rewthamrongsris, Jirayu Burapacheep, Ekarat Phattarataratip, Promphakkon Kulthanaamondhita, Antonin Tichy, Falk Schwendicke, Thanaphum Osathanon, Kraisorn Sappayatosok
Format: Article
Language:English
Published: Elsevier 2025-08-01
Series:International Dental Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S0020653925001376
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Introduction and aims: The overlapping characteristics of oral lichen planus (OLP), a chronic oral mucosal inflammatory condition, with those of other oral lesions, present diagnostic challenges. Large language models (LLMs) with integrated computer-vision capabilities and convolutional neural networks (CNNs) constitute an alternative diagnostic modality. We evaluated the ability of seven LLMs, including both proprietary and open-source models, to detect OLP from intraoral images and generate differential diagnoses. Methods: Using a dataset with 1,142 clinical photographs of histopathologically confirmed OLP, non-OLP lesions, and normal mucosa. The LLMs were tested using three experimental designs: zero-shot recognition, example-guided recognition, and differential diagnosis. Performance was measured using accuracy, precision, recall, F1-score, and discounted cumulative gain (DCG). Furthermore, the performance of LLMs was compared with three previously published CNN-based models for OLP detection on a subset of 110 photographs, which were previously used to test the CNN models. Results: Gemini 1.5 Pro and Flash demonstrated the highest accuracy (69.69%) in zero-shot recognition, whereas GPT-4o ranked first in the F1 score (76.10%). With example-guided prompts, which improved consistency and reduced refusal rates, Gemini 1.5 Flash achieved the highest accuracy (80.53%) and F1-score (84.54%); however, Claude 3.5 Sonnet achieved the highest DCG score of 0.63. Although the proprietary models generally excelled, the open-source Llama model demonstrated notable strengths in ranking relevant diagnoses despite moderate performance in detection tasks. All LLMs were outperformed by the CNN models. Conclusion: The seven evaluated LLMs lack sufficient performance for clinical use. CNNs trained to detect OLP outperformed the LLMs tested in this study.
ISSN:0020-6539