Data-driven protease engineering by DNA-recording and epistasis-aware machine learning

Abstract Protein engineering has recently seen tremendous transformation due to machine learning (ML) tools that predict structure from sequence at unprecedented precision. Predicting catalytic activity, however, remains challenging, restricting our capabilities to design protein sequences with desi...

Full description

Saved in:
Bibliographic Details
Main Authors: Lukas Huber, Tim Kucera, Simon Höllerer, Karsten Borgwardt, Sven Panke, Markus Jeschek
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Nature Communications
Online Access:https://doi.org/10.1038/s41467-025-60622-7
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849238522157858816
author Lukas Huber
Tim Kucera
Simon Höllerer
Karsten Borgwardt
Sven Panke
Markus Jeschek
author_facet Lukas Huber
Tim Kucera
Simon Höllerer
Karsten Borgwardt
Sven Panke
Markus Jeschek
author_sort Lukas Huber
collection DOAJ
description Abstract Protein engineering has recently seen tremendous transformation due to machine learning (ML) tools that predict structure from sequence at unprecedented precision. Predicting catalytic activity, however, remains challenging, restricting our capabilities to design protein sequences with desired catalytic function in silico. This predicament is mainly rooted in a lack of experimental methods capable of recording sequence-activity data in quantities sufficient for data-intensive ML techniques, and the inefficiency of searches in the enormous sequence spaces inherent to proteins. Herein, we address both limitations in the context of engineering proteases with tailored substrate specificity. We introduce a DNA recorder for deep specificity profiling of proteases in Escherichia coli as we demonstrate testing 29,716 candidate proteases against up to 134 substrates in parallel. The resulting sequence-activity data on approximately 600,000 protease-substrate pairs does not only reveal key sequence determinants governing protease specificity, but allows to build a data-efficient deep learning model that accurately predicts protease sequences with desired on- and off-target activities. Moreover, we present epistasis-aware training set design as a generalizable strategy to streamline searches within enormous sequence spaces, which strongly increases model accuracy at given experimental efforts and is thus likely to have implications for protein engineering far beyond proteases.
format Article
id doaj-art-299c68abd9f548babd155b17be4dfe82
institution Kabale University
issn 2041-1723
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series Nature Communications
spelling doaj-art-299c68abd9f548babd155b17be4dfe822025-08-20T04:01:35ZengNature PortfolioNature Communications2041-17232025-07-0116111510.1038/s41467-025-60622-7Data-driven protease engineering by DNA-recording and epistasis-aware machine learningLukas Huber0Tim Kucera1Simon Höllerer2Karsten Borgwardt3Sven Panke4Markus Jeschek5Department of Biosystems Science and Engineering, ETH ZurichDepartment of Biosystems Science and Engineering, ETH ZurichDepartment of Biosystems Science and Engineering, ETH ZurichDepartment of Biosystems Science and Engineering, ETH ZurichDepartment of Biosystems Science and Engineering, ETH ZurichDepartment of Biosystems Science and Engineering, ETH ZurichAbstract Protein engineering has recently seen tremendous transformation due to machine learning (ML) tools that predict structure from sequence at unprecedented precision. Predicting catalytic activity, however, remains challenging, restricting our capabilities to design protein sequences with desired catalytic function in silico. This predicament is mainly rooted in a lack of experimental methods capable of recording sequence-activity data in quantities sufficient for data-intensive ML techniques, and the inefficiency of searches in the enormous sequence spaces inherent to proteins. Herein, we address both limitations in the context of engineering proteases with tailored substrate specificity. We introduce a DNA recorder for deep specificity profiling of proteases in Escherichia coli as we demonstrate testing 29,716 candidate proteases against up to 134 substrates in parallel. The resulting sequence-activity data on approximately 600,000 protease-substrate pairs does not only reveal key sequence determinants governing protease specificity, but allows to build a data-efficient deep learning model that accurately predicts protease sequences with desired on- and off-target activities. Moreover, we present epistasis-aware training set design as a generalizable strategy to streamline searches within enormous sequence spaces, which strongly increases model accuracy at given experimental efforts and is thus likely to have implications for protein engineering far beyond proteases.https://doi.org/10.1038/s41467-025-60622-7
spellingShingle Lukas Huber
Tim Kucera
Simon Höllerer
Karsten Borgwardt
Sven Panke
Markus Jeschek
Data-driven protease engineering by DNA-recording and epistasis-aware machine learning
Nature Communications
title Data-driven protease engineering by DNA-recording and epistasis-aware machine learning
title_full Data-driven protease engineering by DNA-recording and epistasis-aware machine learning
title_fullStr Data-driven protease engineering by DNA-recording and epistasis-aware machine learning
title_full_unstemmed Data-driven protease engineering by DNA-recording and epistasis-aware machine learning
title_short Data-driven protease engineering by DNA-recording and epistasis-aware machine learning
title_sort data driven protease engineering by dna recording and epistasis aware machine learning
url https://doi.org/10.1038/s41467-025-60622-7
work_keys_str_mv AT lukashuber datadrivenproteaseengineeringbydnarecordingandepistasisawaremachinelearning
AT timkucera datadrivenproteaseengineeringbydnarecordingandepistasisawaremachinelearning
AT simonhollerer datadrivenproteaseengineeringbydnarecordingandepistasisawaremachinelearning
AT karstenborgwardt datadrivenproteaseengineeringbydnarecordingandepistasisawaremachinelearning
AT svenpanke datadrivenproteaseengineeringbydnarecordingandepistasisawaremachinelearning
AT markusjeschek datadrivenproteaseengineeringbydnarecordingandepistasisawaremachinelearning