Benchmarking protein language models for protein crystallization

Abstract The problem of protein structure determination is usually solved by X-ray crystallography. Several in silico deep learning methods have been developed to overcome the high attrition rate, cost of experiments and extensive trial-and-error settings, for predicting the crystallization propensi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Raghvendra Mall, Rahul Kaushik, Zachary A. Martinez, Matt W. Thomson, Filippo Castiglione
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-01-01
Series:	Scientific Reports
Subjects:	Open protein language models (PLMs) Protein crystallization Benchmarking Protein generation
Online Access:	https://doi.org/10.1038/s41598-025-86519-5
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832594745220661248
author	Raghvendra Mall Rahul Kaushik Zachary A. Martinez Matt W. Thomson Filippo Castiglione
author_facet	Raghvendra Mall Rahul Kaushik Zachary A. Martinez Matt W. Thomson Filippo Castiglione
author_sort	Raghvendra Mall
collection	DOAJ
description	Abstract The problem of protein structure determination is usually solved by X-ray crystallography. Several in silico deep learning methods have been developed to overcome the high attrition rate, cost of experiments and extensive trial-and-error settings, for predicting the crystallization propensities of proteins based on their sequences. In this work, we benchmark the power of open protein language models (PLMs) through the TRILL platform, a be-spoke framework democratizing the usage of PLMs for the task of predicting crystallization propensities of proteins. By comparing LightGBM / XGBoost classifiers built on the average embedding representations of proteins learned by different PLMs, such as ESM2, Ankh, ProtT5-XL, ProstT5, xTrimoPGLM, SaProt with the performance of state-of-the-art sequence-based methods like DeepCrystal, ATTCrys and CLPred, we identify the most effective methods for predicting crystallization outcomes. The LightGBM classifiers utilizing embeddings from ESM2 model with 30 and 36 transformer layers and 150 and 3000 million parameters respectively have performance gains by 3- $$5\%$$ than all compared models for various evaluation metrics, including AUPR (Area Under Precision-Recall Curve), AUC (Area Under the Receiver Operating Characteristic Curve), and F1 on independent test sets. Furthermore, we fine-tune the ProtGPT2 model available via TRILL to generate crystallizable proteins. Starting with 3000 generated proteins and through a step of filtration processes including consensus of all open PLM-based classifiers, sequence identity through CD-HIT, secondary structure compatibility, aggregation screening, homology search and foldability evaluation, we identified a set of 5 novel proteins as potentially crystallizable.
format	Article
id	doaj-art-227fc1d18f624ba9b93e6383d0e464f8
institution	Kabale University
issn	2045-2322
language	English
publishDate	2025-01-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-227fc1d18f624ba9b93e6383d0e464f82025-01-19T12:22:03ZengNature PortfolioScientific Reports2045-23222025-01-0115111710.1038/s41598-025-86519-5Benchmarking protein language models for protein crystallizationRaghvendra Mall0Rahul Kaushik1Zachary A. Martinez2Matt W. Thomson3Filippo Castiglione4Biotechnology Research Center, Technology Innovation InstituteBiotechnology Research Center, Technology Innovation InstituteDivision of Biology and Bioengineering, California Institute of TechnologyDivision of Biology and Bioengineering, California Institute of TechnologyBiotechnology Research Center, Technology Innovation InstituteAbstract The problem of protein structure determination is usually solved by X-ray crystallography. Several in silico deep learning methods have been developed to overcome the high attrition rate, cost of experiments and extensive trial-and-error settings, for predicting the crystallization propensities of proteins based on their sequences. In this work, we benchmark the power of open protein language models (PLMs) through the TRILL platform, a be-spoke framework democratizing the usage of PLMs for the task of predicting crystallization propensities of proteins. By comparing LightGBM / XGBoost classifiers built on the average embedding representations of proteins learned by different PLMs, such as ESM2, Ankh, ProtT5-XL, ProstT5, xTrimoPGLM, SaProt with the performance of state-of-the-art sequence-based methods like DeepCrystal, ATTCrys and CLPred, we identify the most effective methods for predicting crystallization outcomes. The LightGBM classifiers utilizing embeddings from ESM2 model with 30 and 36 transformer layers and 150 and 3000 million parameters respectively have performance gains by 3- $$5\%$$ than all compared models for various evaluation metrics, including AUPR (Area Under Precision-Recall Curve), AUC (Area Under the Receiver Operating Characteristic Curve), and F1 on independent test sets. Furthermore, we fine-tune the ProtGPT2 model available via TRILL to generate crystallizable proteins. Starting with 3000 generated proteins and through a step of filtration processes including consensus of all open PLM-based classifiers, sequence identity through CD-HIT, secondary structure compatibility, aggregation screening, homology search and foldability evaluation, we identified a set of 5 novel proteins as potentially crystallizable.https://doi.org/10.1038/s41598-025-86519-5Open protein language models (PLMs)Protein crystallizationBenchmarkingProtein generation
spellingShingle	Raghvendra Mall Rahul Kaushik Zachary A. Martinez Matt W. Thomson Filippo Castiglione Benchmarking protein language models for protein crystallization Scientific Reports Open protein language models (PLMs) Protein crystallization Benchmarking Protein generation
title	Benchmarking protein language models for protein crystallization
title_full	Benchmarking protein language models for protein crystallization
title_fullStr	Benchmarking protein language models for protein crystallization
title_full_unstemmed	Benchmarking protein language models for protein crystallization
title_short	Benchmarking protein language models for protein crystallization
title_sort	benchmarking protein language models for protein crystallization
topic	Open protein language models (PLMs) Protein crystallization Benchmarking Protein generation
url	https://doi.org/10.1038/s41598-025-86519-5
work_keys_str_mv	AT raghvendramall benchmarkingproteinlanguagemodelsforproteincrystallization AT rahulkaushik benchmarkingproteinlanguagemodelsforproteincrystallization AT zacharyamartinez benchmarkingproteinlanguagemodelsforproteincrystallization AT mattwthomson benchmarkingproteinlanguagemodelsforproteincrystallization AT filippocastiglione benchmarkingproteinlanguagemodelsforproteincrystallization

Benchmarking protein language models for protein crystallization

Similar Items