Effective Gene Expression Prediction and Optimization from Protein Sequences

Abstract High soluble protein expression in heterologous hosts is crucial for various research and applications. Despite considerable research on the impact of codon usage on expression levels, the relationship between protein sequence and expression is often overlooked. In this study, a novel conne...

Full description

Saved in:
Bibliographic Details
Main Authors: Tuoyu Liu, Yiyang Zhang, Yanjun Li, Guoshun Xu, Han Gao, Pengtao Wang, Tao Tu, Huiying Luo, Ningfeng Wu, Bin Yao, Bo Liu, Feifei Guan, Huoqing Huang, Jian Tian
Format: Article
Language:English
Published: Wiley 2025-02-01
Series:Advanced Science
Subjects:
Online Access:https://doi.org/10.1002/advs.202407664
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849727293089382400
author Tuoyu Liu
Yiyang Zhang
Yanjun Li
Guoshun Xu
Han Gao
Pengtao Wang
Tao Tu
Huiying Luo
Ningfeng Wu
Bin Yao
Bo Liu
Feifei Guan
Huoqing Huang
Jian Tian
author_facet Tuoyu Liu
Yiyang Zhang
Yanjun Li
Guoshun Xu
Han Gao
Pengtao Wang
Tao Tu
Huiying Luo
Ningfeng Wu
Bin Yao
Bo Liu
Feifei Guan
Huoqing Huang
Jian Tian
author_sort Tuoyu Liu
collection DOAJ
description Abstract High soluble protein expression in heterologous hosts is crucial for various research and applications. Despite considerable research on the impact of codon usage on expression levels, the relationship between protein sequence and expression is often overlooked. In this study, a novel connection between protein expression and sequence is uncovered, leading to the development of SRAB (Strength of Relative Amino Acid Bias) based on AEI (Amino Acid Expression Index). The AEI served as an objective measure of this correlation, with higher AEI values enhancing soluble expression. Subsequently, the pre‐trained protein model MP‐TRANS (MindSpore Protein Transformer) is developed and fine‐tuned using transfer learning techniques to create 88 prediction models (MPB‐EXP) for predicting heterologous expression levels across 88 species. This approach achieved an average accuracy of 0.78, surpassing conventional machine learning methods. Additionally, a mutant generation model, MPB‐MUT, is devised and utilized to enhance expression levels in specific hosts. Experimental validation demonstrated that the top 3 mutants of xylanase (previously not expressed in Escherichia coli) successfully achieved high‐level soluble expression in E. coli. These findings highlight the efficacy of the developed model in predicting and optimizing gene expression based on protein sequences.
format Article
id doaj-art-16ee03eeeef94a2f81bb441001ac4af9
institution DOAJ
issn 2198-3844
language English
publishDate 2025-02-01
publisher Wiley
record_format Article
series Advanced Science
spelling doaj-art-16ee03eeeef94a2f81bb441001ac4af92025-08-20T03:09:54ZengWileyAdvanced Science2198-38442025-02-01128n/an/a10.1002/advs.202407664Effective Gene Expression Prediction and Optimization from Protein SequencesTuoyu Liu0Yiyang Zhang1Yanjun Li2Guoshun Xu3Han Gao4Pengtao Wang5Tao Tu6Huiying Luo7Ningfeng Wu8Bin Yao9Bo Liu10Feifei Guan11Huoqing Huang12Jian Tian13State Key Laboratory of Animal Nutrition and Feeding Institute of Animal Sciences Chinese Academy of Agricultural Sciences Beijing 100193 ChinaNational Key Laboratory of Agricultural Microbiology Biotechnology Research Institute Chinese Academy of Agricultural Sciences Beijing 100081 ChinaNational Key Laboratory of Agricultural Microbiology Biotechnology Research Institute Chinese Academy of Agricultural Sciences Beijing 100081 ChinaState Key Laboratory of Animal Nutrition and Feeding Institute of Animal Sciences Chinese Academy of Agricultural Sciences Beijing 100193 ChinaState Key Laboratory of Animal Nutrition and Feeding Institute of Animal Sciences Chinese Academy of Agricultural Sciences Beijing 100193 ChinaNational Key Laboratory of Agricultural Microbiology Biotechnology Research Institute Chinese Academy of Agricultural Sciences Beijing 100081 ChinaState Key Laboratory of Animal Nutrition and Feeding Institute of Animal Sciences Chinese Academy of Agricultural Sciences Beijing 100193 ChinaState Key Laboratory of Animal Nutrition and Feeding Institute of Animal Sciences Chinese Academy of Agricultural Sciences Beijing 100193 ChinaNational Key Laboratory of Agricultural Microbiology Biotechnology Research Institute Chinese Academy of Agricultural Sciences Beijing 100081 ChinaState Key Laboratory of Animal Nutrition and Feeding Institute of Animal Sciences Chinese Academy of Agricultural Sciences Beijing 100193 ChinaNational Key Laboratory of Agricultural Microbiology Biotechnology Research Institute Chinese Academy of Agricultural Sciences Beijing 100081 ChinaNational Key Laboratory of Agricultural Microbiology Biotechnology Research Institute Chinese Academy of Agricultural Sciences Beijing 100081 ChinaState Key Laboratory of Animal Nutrition and Feeding Institute of Animal Sciences Chinese Academy of Agricultural Sciences Beijing 100193 ChinaState Key Laboratory of Animal Nutrition and Feeding Institute of Animal Sciences Chinese Academy of Agricultural Sciences Beijing 100193 ChinaAbstract High soluble protein expression in heterologous hosts is crucial for various research and applications. Despite considerable research on the impact of codon usage on expression levels, the relationship between protein sequence and expression is often overlooked. In this study, a novel connection between protein expression and sequence is uncovered, leading to the development of SRAB (Strength of Relative Amino Acid Bias) based on AEI (Amino Acid Expression Index). The AEI served as an objective measure of this correlation, with higher AEI values enhancing soluble expression. Subsequently, the pre‐trained protein model MP‐TRANS (MindSpore Protein Transformer) is developed and fine‐tuned using transfer learning techniques to create 88 prediction models (MPB‐EXP) for predicting heterologous expression levels across 88 species. This approach achieved an average accuracy of 0.78, surpassing conventional machine learning methods. Additionally, a mutant generation model, MPB‐MUT, is devised and utilized to enhance expression levels in specific hosts. Experimental validation demonstrated that the top 3 mutants of xylanase (previously not expressed in Escherichia coli) successfully achieved high‐level soluble expression in E. coli. These findings highlight the efficacy of the developed model in predicting and optimizing gene expression based on protein sequences.https://doi.org/10.1002/advs.202407664amino acid expression indexmutant generationpredicting protein expressionsoluble expressiontransfer learning
spellingShingle Tuoyu Liu
Yiyang Zhang
Yanjun Li
Guoshun Xu
Han Gao
Pengtao Wang
Tao Tu
Huiying Luo
Ningfeng Wu
Bin Yao
Bo Liu
Feifei Guan
Huoqing Huang
Jian Tian
Effective Gene Expression Prediction and Optimization from Protein Sequences
Advanced Science
amino acid expression index
mutant generation
predicting protein expression
soluble expression
transfer learning
title Effective Gene Expression Prediction and Optimization from Protein Sequences
title_full Effective Gene Expression Prediction and Optimization from Protein Sequences
title_fullStr Effective Gene Expression Prediction and Optimization from Protein Sequences
title_full_unstemmed Effective Gene Expression Prediction and Optimization from Protein Sequences
title_short Effective Gene Expression Prediction and Optimization from Protein Sequences
title_sort effective gene expression prediction and optimization from protein sequences
topic amino acid expression index
mutant generation
predicting protein expression
soluble expression
transfer learning
url https://doi.org/10.1002/advs.202407664
work_keys_str_mv AT tuoyuliu effectivegeneexpressionpredictionandoptimizationfromproteinsequences
AT yiyangzhang effectivegeneexpressionpredictionandoptimizationfromproteinsequences
AT yanjunli effectivegeneexpressionpredictionandoptimizationfromproteinsequences
AT guoshunxu effectivegeneexpressionpredictionandoptimizationfromproteinsequences
AT hangao effectivegeneexpressionpredictionandoptimizationfromproteinsequences
AT pengtaowang effectivegeneexpressionpredictionandoptimizationfromproteinsequences
AT taotu effectivegeneexpressionpredictionandoptimizationfromproteinsequences
AT huiyingluo effectivegeneexpressionpredictionandoptimizationfromproteinsequences
AT ningfengwu effectivegeneexpressionpredictionandoptimizationfromproteinsequences
AT binyao effectivegeneexpressionpredictionandoptimizationfromproteinsequences
AT boliu effectivegeneexpressionpredictionandoptimizationfromproteinsequences
AT feifeiguan effectivegeneexpressionpredictionandoptimizationfromproteinsequences
AT huoqinghuang effectivegeneexpressionpredictionandoptimizationfromproteinsequences
AT jiantian effectivegeneexpressionpredictionandoptimizationfromproteinsequences