Designing diverse and high-performance proteins with a large language model in the loop.

We present a protein engineering approach to directed evolution with machine learning that integrates a new semi-supervised neural network fitness prediction model, Seq2Fitness, and an innovative optimization algorithm, biphasic annealing for diverse and adaptive sequence sampling (BADASS) to design...

Full description

Saved in:
Bibliographic Details
Main Authors: Carlos A Gomez-Uribe, Japheth Gado, Meiirbek Islamov
Format: Article
Language:English
Published: Public Library of Science (PLoS) 2025-06-01
Series:PLoS Computational Biology
Online Access:https://doi.org/10.1371/journal.pcbi.1013119
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850117795408248832
author Carlos A Gomez-Uribe
Japheth Gado
Meiirbek Islamov
author_facet Carlos A Gomez-Uribe
Japheth Gado
Meiirbek Islamov
author_sort Carlos A Gomez-Uribe
collection DOAJ
description We present a protein engineering approach to directed evolution with machine learning that integrates a new semi-supervised neural network fitness prediction model, Seq2Fitness, and an innovative optimization algorithm, biphasic annealing for diverse and adaptive sequence sampling (BADASS) to design sequences. Seq2Fitness leverages protein language models to predict fitness landscapes, combining evolutionary data with experimental labels, while BADASS efficiently explores these landscapes by dynamically adjusting temperature and mutation energies to prevent premature convergence and to generate diverse high-fitness sequences. Compared to alternative models, Seq2Fitness improves Spearman correlation with experimental fitness measurements, increasing from 0.34 to 0.55 on sequences containing mutations at positions entirely not seen during training. BADASS requires less memory and computation compared to gradient-based Markov Chain Monte Carlo methods, while generating more high-fitness and diverse sequences across two protein families. For both families, 100% of the top 10,000 sequences identified by BADASS exceed the wildtype in predicted fitness, whereas competing methods range from 3% to 99%, often producing far fewer than 10,000 sequences. BADASS also finds higher-fitness sequences at every cutoff (top 1, 100, and 10,000). Additionally, we provide a theoretical framework explaining BADASS's underlying mechanism and behavior. While we focus on amino acid sequences, BADASS may generalize to other sequence spaces, such as DNA and RNA.
format Article
id doaj-art-b368d553fcaa4d00b0aa366b840215bf
institution OA Journals
issn 1553-734X
1553-7358
language English
publishDate 2025-06-01
publisher Public Library of Science (PLoS)
record_format Article
series PLoS Computational Biology
spelling doaj-art-b368d553fcaa4d00b0aa366b840215bf2025-08-20T02:36:02ZengPublic Library of Science (PLoS)PLoS Computational Biology1553-734X1553-73582025-06-01216e101311910.1371/journal.pcbi.1013119Designing diverse and high-performance proteins with a large language model in the loop.Carlos A Gomez-UribeJapheth GadoMeiirbek IslamovWe present a protein engineering approach to directed evolution with machine learning that integrates a new semi-supervised neural network fitness prediction model, Seq2Fitness, and an innovative optimization algorithm, biphasic annealing for diverse and adaptive sequence sampling (BADASS) to design sequences. Seq2Fitness leverages protein language models to predict fitness landscapes, combining evolutionary data with experimental labels, while BADASS efficiently explores these landscapes by dynamically adjusting temperature and mutation energies to prevent premature convergence and to generate diverse high-fitness sequences. Compared to alternative models, Seq2Fitness improves Spearman correlation with experimental fitness measurements, increasing from 0.34 to 0.55 on sequences containing mutations at positions entirely not seen during training. BADASS requires less memory and computation compared to gradient-based Markov Chain Monte Carlo methods, while generating more high-fitness and diverse sequences across two protein families. For both families, 100% of the top 10,000 sequences identified by BADASS exceed the wildtype in predicted fitness, whereas competing methods range from 3% to 99%, often producing far fewer than 10,000 sequences. BADASS also finds higher-fitness sequences at every cutoff (top 1, 100, and 10,000). Additionally, we provide a theoretical framework explaining BADASS's underlying mechanism and behavior. While we focus on amino acid sequences, BADASS may generalize to other sequence spaces, such as DNA and RNA.https://doi.org/10.1371/journal.pcbi.1013119
spellingShingle Carlos A Gomez-Uribe
Japheth Gado
Meiirbek Islamov
Designing diverse and high-performance proteins with a large language model in the loop.
PLoS Computational Biology
title Designing diverse and high-performance proteins with a large language model in the loop.
title_full Designing diverse and high-performance proteins with a large language model in the loop.
title_fullStr Designing diverse and high-performance proteins with a large language model in the loop.
title_full_unstemmed Designing diverse and high-performance proteins with a large language model in the loop.
title_short Designing diverse and high-performance proteins with a large language model in the loop.
title_sort designing diverse and high performance proteins with a large language model in the loop
url https://doi.org/10.1371/journal.pcbi.1013119
work_keys_str_mv AT carlosagomezuribe designingdiverseandhighperformanceproteinswithalargelanguagemodelintheloop
AT japhethgado designingdiverseandhighperformanceproteinswithalargelanguagemodelintheloop
AT meiirbekislamov designingdiverseandhighperformanceproteinswithalargelanguagemodelintheloop