Contrastive-learning of language embedding and biological features for cross modality encoding and effector prediction

Abstract Identifying and characterizing virulence proteins secreted by Gram-negative bacteria are fundamental for deciphering microbial pathogenicity as well as aiding the development of therapeutic strategies. Effector predictors utilizing pre-trained protein language models (PLMs) have shown sound...

Full description

Saved in:
Bibliographic Details
Main Authors: Yue Peng, Junze Wu, Yi Sun, Yuanxing Zhang, Qiyao Wang, Shuai Shao
Format: Article
Language:English
Published: Nature Portfolio 2025-02-01
Series:Nature Communications
Online Access:https://doi.org/10.1038/s41467-025-56526-1
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1823861850662502400
author Yue Peng
Junze Wu
Yi Sun
Yuanxing Zhang
Qiyao Wang
Shuai Shao
author_facet Yue Peng
Junze Wu
Yi Sun
Yuanxing Zhang
Qiyao Wang
Shuai Shao
author_sort Yue Peng
collection DOAJ
description Abstract Identifying and characterizing virulence proteins secreted by Gram-negative bacteria are fundamental for deciphering microbial pathogenicity as well as aiding the development of therapeutic strategies. Effector predictors utilizing pre-trained protein language models (PLMs) have shown sound performance by leveraging extensive evolutionary and sequential protein features. However, the accuracy and sensitivity of effector prediction remain challenging. Here, we introduce a model named Contrastive-learning of Language Embedding and Biological Features (CLEF) leveraging contrastive learning to integrate PLM representations with supplementary biological features. Biologically information is captured in learned contextualized embeddings to yield meaningful representations. With cross-modality biological features, CLEF outperforms state-of-the-art (SOTA) models in predicting type III, type IV, and type VI secreted effectors (T3SEs/T4SEs/T6SEs) in enteric pathogens. All experimentally verified effectors in Enterohemorrhagic Escherichia coli and 41 of 43 experimentally verified T3SEs of Salmonella Typhimurium are recognized. Moreover, 12 predicted T3SEs and 11 predicted T6SEs are validated by extensive experiments in Edwardsiella piscicida. Furthermore, integrating omics data via CLEF framework enhances protein representations to illustrate effector-effector interactions and determine in vivo colonization-essential genes. Collectively, CLEF provides a blueprint to bridge the gap between in silico PLM’s capacity and experimental biological information to fulfill complicated tasks.
format Article
id doaj-art-e412725c350746998965d9a9021b98b7
institution Kabale University
issn 2041-1723
language English
publishDate 2025-02-01
publisher Nature Portfolio
record_format Article
series Nature Communications
spelling doaj-art-e412725c350746998965d9a9021b98b72025-02-09T12:46:17ZengNature PortfolioNature Communications2041-17232025-02-0116112010.1038/s41467-025-56526-1Contrastive-learning of language embedding and biological features for cross modality encoding and effector predictionYue Peng0Junze Wu1Yi Sun2Yuanxing Zhang3Qiyao Wang4Shuai Shao5State Key Laboratory of Bioreactor Engineering, East China University of Science and TechnologyState Key Laboratory of Bioreactor Engineering, East China University of Science and TechnologyState Key Laboratory of Bioreactor Engineering, East China University of Science and TechnologySouthern Marine Science and Engineering Guangdong Laboratory (Zhuhai)State Key Laboratory of Bioreactor Engineering, East China University of Science and TechnologyState Key Laboratory of Bioreactor Engineering, East China University of Science and TechnologyAbstract Identifying and characterizing virulence proteins secreted by Gram-negative bacteria are fundamental for deciphering microbial pathogenicity as well as aiding the development of therapeutic strategies. Effector predictors utilizing pre-trained protein language models (PLMs) have shown sound performance by leveraging extensive evolutionary and sequential protein features. However, the accuracy and sensitivity of effector prediction remain challenging. Here, we introduce a model named Contrastive-learning of Language Embedding and Biological Features (CLEF) leveraging contrastive learning to integrate PLM representations with supplementary biological features. Biologically information is captured in learned contextualized embeddings to yield meaningful representations. With cross-modality biological features, CLEF outperforms state-of-the-art (SOTA) models in predicting type III, type IV, and type VI secreted effectors (T3SEs/T4SEs/T6SEs) in enteric pathogens. All experimentally verified effectors in Enterohemorrhagic Escherichia coli and 41 of 43 experimentally verified T3SEs of Salmonella Typhimurium are recognized. Moreover, 12 predicted T3SEs and 11 predicted T6SEs are validated by extensive experiments in Edwardsiella piscicida. Furthermore, integrating omics data via CLEF framework enhances protein representations to illustrate effector-effector interactions and determine in vivo colonization-essential genes. Collectively, CLEF provides a blueprint to bridge the gap between in silico PLM’s capacity and experimental biological information to fulfill complicated tasks.https://doi.org/10.1038/s41467-025-56526-1
spellingShingle Yue Peng
Junze Wu
Yi Sun
Yuanxing Zhang
Qiyao Wang
Shuai Shao
Contrastive-learning of language embedding and biological features for cross modality encoding and effector prediction
Nature Communications
title Contrastive-learning of language embedding and biological features for cross modality encoding and effector prediction
title_full Contrastive-learning of language embedding and biological features for cross modality encoding and effector prediction
title_fullStr Contrastive-learning of language embedding and biological features for cross modality encoding and effector prediction
title_full_unstemmed Contrastive-learning of language embedding and biological features for cross modality encoding and effector prediction
title_short Contrastive-learning of language embedding and biological features for cross modality encoding and effector prediction
title_sort contrastive learning of language embedding and biological features for cross modality encoding and effector prediction
url https://doi.org/10.1038/s41467-025-56526-1
work_keys_str_mv AT yuepeng contrastivelearningoflanguageembeddingandbiologicalfeaturesforcrossmodalityencodingandeffectorprediction
AT junzewu contrastivelearningoflanguageembeddingandbiologicalfeaturesforcrossmodalityencodingandeffectorprediction
AT yisun contrastivelearningoflanguageembeddingandbiologicalfeaturesforcrossmodalityencodingandeffectorprediction
AT yuanxingzhang contrastivelearningoflanguageembeddingandbiologicalfeaturesforcrossmodalityencodingandeffectorprediction
AT qiyaowang contrastivelearningoflanguageembeddingandbiologicalfeaturesforcrossmodalityencodingandeffectorprediction
AT shuaishao contrastivelearningoflanguageembeddingandbiologicalfeaturesforcrossmodalityencodingandeffectorprediction