Medium-sized protein language models perform well at transfer learning on realistic datasets

Abstract Protein language models (pLMs) can offer deep insights into evolutionary and structural properties of proteins. While larger models, such as the 15 billion parameter model ESM-2, promise to capture more complex patterns in sequence space, they also present practical challenges due to their...

Full description

Saved in:

Bibliographic Details
Main Authors:	Luiz C. Vieira, Morgan L. Handojo, Claus O. Wilke
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-07-01
Series:	Scientific Reports
Subjects:	ESM Transfer learning pLM embeddings Embeddings compression
Online Access:	https://doi.org/10.1038/s41598-025-05674-x
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849334887512801280
author	Luiz C. Vieira Morgan L. Handojo Claus O. Wilke
author_facet	Luiz C. Vieira Morgan L. Handojo Claus O. Wilke
author_sort	Luiz C. Vieira
collection	DOAJ
description	Abstract Protein language models (pLMs) can offer deep insights into evolutionary and structural properties of proteins. While larger models, such as the 15 billion parameter model ESM-2, promise to capture more complex patterns in sequence space, they also present practical challenges due to their high dimensionality and high computational cost. We systematically evaluated the performance of various ESM-style models across multiple biological datasets to assess the impact of model size on transfer learning via feature extraction. Surprisingly, we found that larger models do not necessarily outperform smaller ones, in particular when data is limited. Medium-sized models, such as ESM-2 650M and ESM C 600M, demonstrated consistently good performance, falling only slightly behind their larger counterparts—ESM-2 15B and ESM C 6B—despite being many times smaller. Additionally, we compared various methods of compressing embeddings prior to transfer learning, and we found that mean embeddings consistently outperformed other compression methods. In summary, ESM C 600M with mean embeddings offers an optimal balance between performance and efficiency, making it a practical and scalable choice for transfer learning in realistic biological applications.
format	Article
id	doaj-art-e37866cd0ae047b995a5290162a39bf9
institution	Kabale University
issn	2045-2322
language	English
publishDate	2025-07-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Reports
spelling	doaj-art-e37866cd0ae047b995a5290162a39bf92025-08-20T03:45:27ZengNature PortfolioScientific Reports2045-23222025-07-0115111310.1038/s41598-025-05674-xMedium-sized protein language models perform well at transfer learning on realistic datasetsLuiz C. Vieira0Morgan L. Handojo1Claus O. Wilke2Department of Integrative Biology, The University of Texas at AustinDepartment of Integrative Biology, The University of Texas at AustinDepartment of Integrative Biology, The University of Texas at AustinAbstract Protein language models (pLMs) can offer deep insights into evolutionary and structural properties of proteins. While larger models, such as the 15 billion parameter model ESM-2, promise to capture more complex patterns in sequence space, they also present practical challenges due to their high dimensionality and high computational cost. We systematically evaluated the performance of various ESM-style models across multiple biological datasets to assess the impact of model size on transfer learning via feature extraction. Surprisingly, we found that larger models do not necessarily outperform smaller ones, in particular when data is limited. Medium-sized models, such as ESM-2 650M and ESM C 600M, demonstrated consistently good performance, falling only slightly behind their larger counterparts—ESM-2 15B and ESM C 6B—despite being many times smaller. Additionally, we compared various methods of compressing embeddings prior to transfer learning, and we found that mean embeddings consistently outperformed other compression methods. In summary, ESM C 600M with mean embeddings offers an optimal balance between performance and efficiency, making it a practical and scalable choice for transfer learning in realistic biological applications.https://doi.org/10.1038/s41598-025-05674-xESMTransfer learningpLM embeddingsEmbeddings compression
spellingShingle	Luiz C. Vieira Morgan L. Handojo Claus O. Wilke Medium-sized protein language models perform well at transfer learning on realistic datasets Scientific Reports ESM Transfer learning pLM embeddings Embeddings compression
title	Medium-sized protein language models perform well at transfer learning on realistic datasets
title_full	Medium-sized protein language models perform well at transfer learning on realistic datasets
title_fullStr	Medium-sized protein language models perform well at transfer learning on realistic datasets
title_full_unstemmed	Medium-sized protein language models perform well at transfer learning on realistic datasets
title_short	Medium-sized protein language models perform well at transfer learning on realistic datasets
title_sort	medium sized protein language models perform well at transfer learning on realistic datasets
topic	ESM Transfer learning pLM embeddings Embeddings compression
url	https://doi.org/10.1038/s41598-025-05674-x
work_keys_str_mv	AT luizcvieira mediumsizedproteinlanguagemodelsperformwellattransferlearningonrealisticdatasets AT morganlhandojo mediumsizedproteinlanguagemodelsperformwellattransferlearningonrealisticdatasets AT clausowilke mediumsizedproteinlanguagemodelsperformwellattransferlearningonrealisticdatasets

Medium-sized protein language models perform well at transfer learning on realistic datasets

Similar Items