Classifying Dry Eye Disease Patients from Healthy Controls Using Machine Learning and Metabolomics Data

<b>Background:</b> Dry eye disease is a common disorder of the ocular surface, leading patients to seek eye care. Clinical signs and symptoms are currently used to diagnose dry eye disease. Metabolomics, a method for analyzing biological systems, has been found helpful in identifying dis...

Full description

Saved in:
Bibliographic Details
Main Authors: Sajad Amouei Sheshkal, Morten Gundersen, Michael Alexander Riegler, Øygunn Aass Utheim, Kjell Gunnar Gundersen, Helge Rootwelt, Katja Benedikte Prestø Elgstøen, Hugo Lewi Hammer
Format: Article
Language:English
Published: MDPI AG 2024-11-01
Series:Diagnostics
Subjects:
Online Access:https://www.mdpi.com/2075-4418/14/23/2696
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:<b>Background:</b> Dry eye disease is a common disorder of the ocular surface, leading patients to seek eye care. Clinical signs and symptoms are currently used to diagnose dry eye disease. Metabolomics, a method for analyzing biological systems, has been found helpful in identifying distinct metabolites in patients and in detecting metabolic profiles that may indicate dry eye disease at early stages. In this study, we explored the use of machine learning and metabolomics data to identify cataract patients who suffer from dry eye disease, a topic that, to our knowledge, has not been previously explored. As there is no one-size-fits-all machine learning model for metabolomics data, choosing the most suitable model can significantly affect the quality of predictions and subsequent metabolomics analyses. <b>Methods:</b> To address this challenge, we conducted a comparative analysis of eight machine learning models on two metabolomics data sets from cataract patients with and without dry eye disease. The models were evaluated and optimized using nested k-fold cross-validation. To assess the performance of these models, we selected a set of suitable evaluation metrics tailored to the data set’s challenges. <b>Results:</b> The logistic regression model overall performed the best, achieving the highest area under the curve score of <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>0.8378</mn></mrow></semantics></math></inline-formula>, balanced accuracy of <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>0.735</mn></mrow></semantics></math></inline-formula>, Matthew’s correlation coefficient of <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>0.5147</mn></mrow></semantics></math></inline-formula>, an F1-score of <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>0.8513</mn></mrow></semantics></math></inline-formula>, and a specificity of <inline-formula><math xmlns="http://www.w3.org/1998/Math/MathML" display="inline"><semantics><mrow><mn>0.5667</mn></mrow></semantics></math></inline-formula>. Additionally, following the logistic regression, the XGBoost and Random Forest models also demonstrated good performance. <b>Conclusions:</b> The results show that the logistic regression model with L2 regularization can outperform more complex models on an imbalanced data set with a small sample size and a high number of features, while also avoiding overfitting and delivering consistent performance across cross-validation folds. Additionally, the results demonstrate that it is possible to identify dry eye in cataract patients from tear film metabolomics data using machine learning models.
ISSN:2075-4418