HPClas: A data‐driven approach for identifying halophilic proteins based on catBoost

Abstract Halophilic proteins possess unique structural properties and show high stability under extreme conditions. This distinct characteristic makes them invaluable for application in various aspects such as bioenergy, pharmaceuticals, environmental clean‐up, and energy production. Generally, halo...

Full description

Saved in:
Bibliographic Details
Main Authors: Shantong Hu, Xiaoyu Wang, Zhikang Wang, Menghan Jiang, Shihui Wang, Wenya Wang, Jiangning Song, Guimin Zhang
Format: Article
Language:English
Published: Wiley 2024-12-01
Series:mLife
Subjects:
Online Access:https://doi.org/10.1002/mlf2.12125
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850101970841370624
author Shantong Hu
Xiaoyu Wang
Zhikang Wang
Menghan Jiang
Shihui Wang
Wenya Wang
Jiangning Song
Guimin Zhang
author_facet Shantong Hu
Xiaoyu Wang
Zhikang Wang
Menghan Jiang
Shihui Wang
Wenya Wang
Jiangning Song
Guimin Zhang
author_sort Shantong Hu
collection DOAJ
description Abstract Halophilic proteins possess unique structural properties and show high stability under extreme conditions. This distinct characteristic makes them invaluable for application in various aspects such as bioenergy, pharmaceuticals, environmental clean‐up, and energy production. Generally, halophilic proteins are discovered and characterized through labor‐intensive and time‐consuming wet lab experiments. In this study, we introduce the Halophilic Protein Classifier (HPClas), a machine learning‐based classifier developed using the catBoost ensemble learning technique to identify halophilic proteins. Extensive in silico calculations were conducted on a large public dataset of 12,574 samples and HPClas achieved an area under the receiver operating characteristic curve (AUROC) of 0.844 on an independent test set of 200 samples. The source code and curated dataset of HPClas are publicly available at https://github.com/Showmake2/HPClas. In conclusion, HPClas can be explored as a promising tool to aid in the identification of halophilic proteins and accelerate their application in different fields.
format Article
id doaj-art-bc84685175984b05ab8ae83b3b209888
institution DOAJ
issn 2770-100X
language English
publishDate 2024-12-01
publisher Wiley
record_format Article
series mLife
spelling doaj-art-bc84685175984b05ab8ae83b3b2098882025-08-20T02:39:51ZengWileymLife2770-100X2024-12-013451552610.1002/mlf2.12125HPClas: A data‐driven approach for identifying halophilic proteins based on catBoostShantong Hu0Xiaoyu Wang1Zhikang Wang2Menghan Jiang3Shihui Wang4Wenya Wang5Jiangning Song6Guimin Zhang7College of Life Science and Technology Beijing University of Chemical Technology Beijing ChinaMonash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology Monash University Melbourne Victoria AustraliaMonash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology Monash University Melbourne Victoria AustraliaCollege of Life Science and Technology Beijing University of Chemical Technology Beijing ChinaCollege of Life Science and Technology Beijing University of Chemical Technology Beijing ChinaCollege of Life Science and Technology Beijing University of Chemical Technology Beijing ChinaMonash Biomedicine Discovery Institute and Department of Biochemistry and Molecular Biology Monash University Melbourne Victoria AustraliaCollege of Life Science and Technology Beijing University of Chemical Technology Beijing ChinaAbstract Halophilic proteins possess unique structural properties and show high stability under extreme conditions. This distinct characteristic makes them invaluable for application in various aspects such as bioenergy, pharmaceuticals, environmental clean‐up, and energy production. Generally, halophilic proteins are discovered and characterized through labor‐intensive and time‐consuming wet lab experiments. In this study, we introduce the Halophilic Protein Classifier (HPClas), a machine learning‐based classifier developed using the catBoost ensemble learning technique to identify halophilic proteins. Extensive in silico calculations were conducted on a large public dataset of 12,574 samples and HPClas achieved an area under the receiver operating characteristic curve (AUROC) of 0.844 on an independent test set of 200 samples. The source code and curated dataset of HPClas are publicly available at https://github.com/Showmake2/HPClas. In conclusion, HPClas can be explored as a promising tool to aid in the identification of halophilic proteins and accelerate their application in different fields.https://doi.org/10.1002/mlf2.12125feature engineeringhalophilic proteinmachine learning
spellingShingle Shantong Hu
Xiaoyu Wang
Zhikang Wang
Menghan Jiang
Shihui Wang
Wenya Wang
Jiangning Song
Guimin Zhang
HPClas: A data‐driven approach for identifying halophilic proteins based on catBoost
mLife
feature engineering
halophilic protein
machine learning
title HPClas: A data‐driven approach for identifying halophilic proteins based on catBoost
title_full HPClas: A data‐driven approach for identifying halophilic proteins based on catBoost
title_fullStr HPClas: A data‐driven approach for identifying halophilic proteins based on catBoost
title_full_unstemmed HPClas: A data‐driven approach for identifying halophilic proteins based on catBoost
title_short HPClas: A data‐driven approach for identifying halophilic proteins based on catBoost
title_sort hpclas a data driven approach for identifying halophilic proteins based on catboost
topic feature engineering
halophilic protein
machine learning
url https://doi.org/10.1002/mlf2.12125
work_keys_str_mv AT shantonghu hpclasadatadrivenapproachforidentifyinghalophilicproteinsbasedoncatboost
AT xiaoyuwang hpclasadatadrivenapproachforidentifyinghalophilicproteinsbasedoncatboost
AT zhikangwang hpclasadatadrivenapproachforidentifyinghalophilicproteinsbasedoncatboost
AT menghanjiang hpclasadatadrivenapproachforidentifyinghalophilicproteinsbasedoncatboost
AT shihuiwang hpclasadatadrivenapproachforidentifyinghalophilicproteinsbasedoncatboost
AT wenyawang hpclasadatadrivenapproachforidentifyinghalophilicproteinsbasedoncatboost
AT jiangningsong hpclasadatadrivenapproachforidentifyinghalophilicproteinsbasedoncatboost
AT guiminzhang hpclasadatadrivenapproachforidentifyinghalophilicproteinsbasedoncatboost