iNClassSec-ESM: Discovering potential non-classical secreted proteins through a novel protein language model

Non-classical secreted proteins (NCSPs) are a class of proteins lacking signal peptides, secreted by Gram-positive bacteria through non-classical secretion pathways. With the increasing demand for highly secreted proteins in recent years, non-classical secretion pathways have received more attention...

Full description

Saved in:
Bibliographic Details
Main Authors: Yizhou Shao, Taigang Liu
Format: Article
Language:English
Published: Elsevier 2025-01-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S200103702500114X
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849727635600441344
author Yizhou Shao
Taigang Liu
author_facet Yizhou Shao
Taigang Liu
author_sort Yizhou Shao
collection DOAJ
description Non-classical secreted proteins (NCSPs) are a class of proteins lacking signal peptides, secreted by Gram-positive bacteria through non-classical secretion pathways. With the increasing demand for highly secreted proteins in recent years, non-classical secretion pathways have received more attention due to their advantages over classical secretion pathways (Sec/Tat). However, because the mechanisms of non-classical secretion pathways are not yet clear, identifying NCSPs through biological experiments is expensive and time-consuming, making it imperative to develop computational methods to address this issue. Existing NCSP prediction methods mainly use traditional handcrafted features to represent proteins from sequence information, which limits the models' ability to capture complex protein characteristics. In this study, we proposed a novel NCSP predictor, iNClassSec-ESM, which combined deep learning with traditional classifiers to enhance prediction performance. iNClassSec-ESM integrates an XGBoost model trained on comprehensive handcrafted features and a Deep Neural Network (DNN) trained on hidden layer embeddings from the protein language model (PLM) ESM3. The ESM3 is the recently proposed multimodal PLM and has not yet been fully explored in terms of protein representation. Therefore, we extracted hidden layer embeddings from ESM3 as inputs for multiple classifiers and deep learning networks, and compared them with existing PLMs. Benchmark experiments indicate that iNClassSec-ESM outperforms most of existing methods across multiple performance metrics and could serve as an effective tool for discovering potential NCSPs. Additionally, the ESM3 hidden layer embeddings, as an innovative protein representation method, show great potential for the application in broader protein-related classification tasks. The source code of iNClassSec-ESM and the ESM3 embeddings extraction script are publicly available at https://github.com/AmamiyaHoshie/iNClassSec-ESM/.
format Article
id doaj-art-cf125a1e6d934c1e9cabca09310b037e
institution DOAJ
issn 2001-0370
language English
publishDate 2025-01-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj-art-cf125a1e6d934c1e9cabca09310b037e2025-08-20T03:09:47ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-01271350135810.1016/j.csbj.2025.03.043iNClassSec-ESM: Discovering potential non-classical secreted proteins through a novel protein language modelYizhou Shao0Taigang Liu1College of Information Technology, Shanghai Ocean University, Shanghai, 201306, ChinaCorresponding author.; College of Information Technology, Shanghai Ocean University, Shanghai, 201306, ChinaNon-classical secreted proteins (NCSPs) are a class of proteins lacking signal peptides, secreted by Gram-positive bacteria through non-classical secretion pathways. With the increasing demand for highly secreted proteins in recent years, non-classical secretion pathways have received more attention due to their advantages over classical secretion pathways (Sec/Tat). However, because the mechanisms of non-classical secretion pathways are not yet clear, identifying NCSPs through biological experiments is expensive and time-consuming, making it imperative to develop computational methods to address this issue. Existing NCSP prediction methods mainly use traditional handcrafted features to represent proteins from sequence information, which limits the models' ability to capture complex protein characteristics. In this study, we proposed a novel NCSP predictor, iNClassSec-ESM, which combined deep learning with traditional classifiers to enhance prediction performance. iNClassSec-ESM integrates an XGBoost model trained on comprehensive handcrafted features and a Deep Neural Network (DNN) trained on hidden layer embeddings from the protein language model (PLM) ESM3. The ESM3 is the recently proposed multimodal PLM and has not yet been fully explored in terms of protein representation. Therefore, we extracted hidden layer embeddings from ESM3 as inputs for multiple classifiers and deep learning networks, and compared them with existing PLMs. Benchmark experiments indicate that iNClassSec-ESM outperforms most of existing methods across multiple performance metrics and could serve as an effective tool for discovering potential NCSPs. Additionally, the ESM3 hidden layer embeddings, as an innovative protein representation method, show great potential for the application in broader protein-related classification tasks. The source code of iNClassSec-ESM and the ESM3 embeddings extraction script are publicly available at https://github.com/AmamiyaHoshie/iNClassSec-ESM/.http://www.sciencedirect.com/science/article/pii/S200103702500114XNon-classical secreted proteinProtein language modelEmbeddingsEnsemble learningProtein representation
spellingShingle Yizhou Shao
Taigang Liu
iNClassSec-ESM: Discovering potential non-classical secreted proteins through a novel protein language model
Computational and Structural Biotechnology Journal
Non-classical secreted protein
Protein language model
Embeddings
Ensemble learning
Protein representation
title iNClassSec-ESM: Discovering potential non-classical secreted proteins through a novel protein language model
title_full iNClassSec-ESM: Discovering potential non-classical secreted proteins through a novel protein language model
title_fullStr iNClassSec-ESM: Discovering potential non-classical secreted proteins through a novel protein language model
title_full_unstemmed iNClassSec-ESM: Discovering potential non-classical secreted proteins through a novel protein language model
title_short iNClassSec-ESM: Discovering potential non-classical secreted proteins through a novel protein language model
title_sort inclasssec esm discovering potential non classical secreted proteins through a novel protein language model
topic Non-classical secreted protein
Protein language model
Embeddings
Ensemble learning
Protein representation
url http://www.sciencedirect.com/science/article/pii/S200103702500114X
work_keys_str_mv AT yizhoushao inclasssecesmdiscoveringpotentialnonclassicalsecretedproteinsthroughanovelproteinlanguagemodel
AT taigangliu inclasssecesmdiscoveringpotentialnonclassicalsecretedproteinsthroughanovelproteinlanguagemodel