Advancing the accuracy of clathrin protein prediction through multi-source protein language models

Abstract Clathrin is a key cytoplasmic protein that serves as the predominant structural element in the formation of coated vesicles. Specifically, clarithin enables the scission of newly formed vesicles from the plasma membrane’s cytoplasmic face. Efficient and accurate identification of clathrins...

Full description

Saved in:
Bibliographic Details
Main Authors: Watshara Shoombuatong, Nalini Schaduangrat, Pakpoom Mookdarsanit, Jaru Nikom, Lawankorn Mookdarsanit
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-08510-4
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849763487278956544
author Watshara Shoombuatong
Nalini Schaduangrat
Pakpoom Mookdarsanit
Jaru Nikom
Lawankorn Mookdarsanit
author_facet Watshara Shoombuatong
Nalini Schaduangrat
Pakpoom Mookdarsanit
Jaru Nikom
Lawankorn Mookdarsanit
author_sort Watshara Shoombuatong
collection DOAJ
description Abstract Clathrin is a key cytoplasmic protein that serves as the predominant structural element in the formation of coated vesicles. Specifically, clarithin enables the scission of newly formed vesicles from the plasma membrane’s cytoplasmic face. Efficient and accurate identification of clathrins is essential for understanding human diseases and aiding drug target development. Recent advancements in computational methods for identifying clathrins using sequence data have greatly improved large-scale clathrin screening. Here, we propose a high-accuracy computational approach, termed PLM-CLA, to achieve more accurate identification of clathrins. In PLM-CLA, we leveraged multi-source pre-trained protein language models (PLMs), which were trained on large-scale protein sequences from multiple database sources, including ProtT5-BFD, ProtT5-UR50, ProstT5, and ESM-2. These models were used to encode complementary feature embeddings, capturing diverse and valuable information. To the best of our knowledge, PLM-CLA is the first attempt designed using various PLM-based embeddings to identify clathrins. To enhance prediction performance, we utilized a feature selection method to optimize these fused feature embeddings. Finally, we employed a long short-term memory (LSTM) neural network model coupled with the optimal feature subset to identify clathrins. Benchmarking experiments, including independent tests, showed that PLM-CLA significantly outperformed state-of-the-art methods, achieving an accuracy of 0.961, MCC of 0.917, and AUC of 0.997. Furthermore, PLM-CLA secured outstanding performance in terms of MCC, with values of 0.971 and 0.904 on two existing independent test datasets. We anticipate that the proposed PLM-CLA model will serve as a promising tool for large-scale identification of clathrins in resource-limited settings.
format Article
id doaj-art-9983c4a2a84147afb8345e012971778e
institution DOAJ
issn 2045-2322
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-9983c4a2a84147afb8345e012971778e2025-08-20T03:05:23ZengNature PortfolioScientific Reports2045-23222025-07-0115111410.1038/s41598-025-08510-4Advancing the accuracy of clathrin protein prediction through multi-source protein language modelsWatshara Shoombuatong0Nalini Schaduangrat1Pakpoom Mookdarsanit2Jaru Nikom3Lawankorn Mookdarsanit4Center for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol UniversityCenter for Research Innovation and Biomedical Informatics, Faculty of Medical Technology, Mahidol UniversityComputer Science and Artificial Intelligence, Faculty of Science, Chandrakasem Rajabhat UniversityResearch Methodology and Data Analytics Program, Faculty of Science and Technology, Prince of Songkla UniversityBusiness Information System, Faculty of Management Science, Chandrakasem Rajabhat UniversityAbstract Clathrin is a key cytoplasmic protein that serves as the predominant structural element in the formation of coated vesicles. Specifically, clarithin enables the scission of newly formed vesicles from the plasma membrane’s cytoplasmic face. Efficient and accurate identification of clathrins is essential for understanding human diseases and aiding drug target development. Recent advancements in computational methods for identifying clathrins using sequence data have greatly improved large-scale clathrin screening. Here, we propose a high-accuracy computational approach, termed PLM-CLA, to achieve more accurate identification of clathrins. In PLM-CLA, we leveraged multi-source pre-trained protein language models (PLMs), which were trained on large-scale protein sequences from multiple database sources, including ProtT5-BFD, ProtT5-UR50, ProstT5, and ESM-2. These models were used to encode complementary feature embeddings, capturing diverse and valuable information. To the best of our knowledge, PLM-CLA is the first attempt designed using various PLM-based embeddings to identify clathrins. To enhance prediction performance, we utilized a feature selection method to optimize these fused feature embeddings. Finally, we employed a long short-term memory (LSTM) neural network model coupled with the optimal feature subset to identify clathrins. Benchmarking experiments, including independent tests, showed that PLM-CLA significantly outperformed state-of-the-art methods, achieving an accuracy of 0.961, MCC of 0.917, and AUC of 0.997. Furthermore, PLM-CLA secured outstanding performance in terms of MCC, with values of 0.971 and 0.904 on two existing independent test datasets. We anticipate that the proposed PLM-CLA model will serve as a promising tool for large-scale identification of clathrins in resource-limited settings.https://doi.org/10.1038/s41598-025-08510-4ClathrinSequence analysisBioinformaticsProtein language modelMachine learningFeature selection
spellingShingle Watshara Shoombuatong
Nalini Schaduangrat
Pakpoom Mookdarsanit
Jaru Nikom
Lawankorn Mookdarsanit
Advancing the accuracy of clathrin protein prediction through multi-source protein language models
Scientific Reports
Clathrin
Sequence analysis
Bioinformatics
Protein language model
Machine learning
Feature selection
title Advancing the accuracy of clathrin protein prediction through multi-source protein language models
title_full Advancing the accuracy of clathrin protein prediction through multi-source protein language models
title_fullStr Advancing the accuracy of clathrin protein prediction through multi-source protein language models
title_full_unstemmed Advancing the accuracy of clathrin protein prediction through multi-source protein language models
title_short Advancing the accuracy of clathrin protein prediction through multi-source protein language models
title_sort advancing the accuracy of clathrin protein prediction through multi source protein language models
topic Clathrin
Sequence analysis
Bioinformatics
Protein language model
Machine learning
Feature selection
url https://doi.org/10.1038/s41598-025-08510-4
work_keys_str_mv AT watsharashoombuatong advancingtheaccuracyofclathrinproteinpredictionthroughmultisourceproteinlanguagemodels
AT nalinischaduangrat advancingtheaccuracyofclathrinproteinpredictionthroughmultisourceproteinlanguagemodels
AT pakpoommookdarsanit advancingtheaccuracyofclathrinproteinpredictionthroughmultisourceproteinlanguagemodels
AT jarunikom advancingtheaccuracyofclathrinproteinpredictionthroughmultisourceproteinlanguagemodels
AT lawankornmookdarsanit advancingtheaccuracyofclathrinproteinpredictionthroughmultisourceproteinlanguagemodels