Environmental adaptations in metagenomes revealed by deep learning

Abstract Background Deep learning has emerged as a powerful tool in the analysis of biological data, including the analysis of large metagenome data. However, its application remains limited due to high computational costs, model complexity, and difficulty extracting biological insights from these a...

Full description

Saved in:
Bibliographic Details
Main Authors: Johanna C. Winder, Simon Poulton, Taoyang Wu, Thomas Mock, Cock van Oosterhout
Format: Article
Language:English
Published: BMC 2025-08-01
Series:BMC Biology
Subjects:
Online Access:https://doi.org/10.1186/s12915-025-02361-1
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849331672088051712
author Johanna C. Winder
Simon Poulton
Taoyang Wu
Thomas Mock
Cock van Oosterhout
author_facet Johanna C. Winder
Simon Poulton
Taoyang Wu
Thomas Mock
Cock van Oosterhout
author_sort Johanna C. Winder
collection DOAJ
description Abstract Background Deep learning has emerged as a powerful tool in the analysis of biological data, including the analysis of large metagenome data. However, its application remains limited due to high computational costs, model complexity, and difficulty extracting biological insights from these artificial neural networks (ANNs). In this study, we applied a transfer learning approach using the ESM-2 protein structure prediction model and our own smaller ANN to classify proteins containing the domain of unknown function 3494 (DUF3494) by their source environments. DUF3494 is found in a diverse group of putative ice-binding and substrate-binding proteins across a range of environments in prokaryotic and eukaryotic microorganisms. They present a compelling test case for exploring the balance between prediction accuracy and interpretability in sequence classification. Results Our ANN analysed 50,669 DUF3494 sequences from publicly available metagenomes, and successfully classified a large proportion of sequences by source environment (polar marine, glacier ice, frozen sediment, rock, subsurface). We identified environment-specific features that appear to drive classification. Our best-performing ANN was able to classify between 75.9 and 97.8% of sequences correctly. To enhance biological interpretability of these predictions, we compared this model with a genetic algorithm (GA), which, although it had lower predictive ability, provided transparent classification rules and predictors. Further in silico mutagenesis of key residues uncovered a vertically aligned column of amino acids on the b-face of the protein which was important for environmental differentiation, suggesting that both methods captured distinct evolutionary and ecological aspects of the sequences. Feature importance analysis identified that steric and electronic properties of the protein were associated with predictive ability. Conclusions Our findings highlight the utility of deep learning for classification of diverse biological sequences and provide a framework for combining methods to improve model interpretability and ecological insights.
format Article
id doaj-art-4b8c284ca04a432eb892f3d4d4c3d811
institution Kabale University
issn 1741-7007
language English
publishDate 2025-08-01
publisher BMC
record_format Article
series BMC Biology
spelling doaj-art-4b8c284ca04a432eb892f3d4d4c3d8112025-08-20T03:46:27ZengBMCBMC Biology1741-70072025-08-0123112010.1186/s12915-025-02361-1Environmental adaptations in metagenomes revealed by deep learningJohanna C. Winder0Simon Poulton1Taoyang Wu2Thomas Mock3Cock van Oosterhout4School of Environmental Sciences, University of East AngliaSchool of Biological Sciences, University of East AngliaSchool of Computing Sciences, University of East AngliaSchool of Environmental Sciences, University of East AngliaSchool of Environmental Sciences, University of East AngliaAbstract Background Deep learning has emerged as a powerful tool in the analysis of biological data, including the analysis of large metagenome data. However, its application remains limited due to high computational costs, model complexity, and difficulty extracting biological insights from these artificial neural networks (ANNs). In this study, we applied a transfer learning approach using the ESM-2 protein structure prediction model and our own smaller ANN to classify proteins containing the domain of unknown function 3494 (DUF3494) by their source environments. DUF3494 is found in a diverse group of putative ice-binding and substrate-binding proteins across a range of environments in prokaryotic and eukaryotic microorganisms. They present a compelling test case for exploring the balance between prediction accuracy and interpretability in sequence classification. Results Our ANN analysed 50,669 DUF3494 sequences from publicly available metagenomes, and successfully classified a large proportion of sequences by source environment (polar marine, glacier ice, frozen sediment, rock, subsurface). We identified environment-specific features that appear to drive classification. Our best-performing ANN was able to classify between 75.9 and 97.8% of sequences correctly. To enhance biological interpretability of these predictions, we compared this model with a genetic algorithm (GA), which, although it had lower predictive ability, provided transparent classification rules and predictors. Further in silico mutagenesis of key residues uncovered a vertically aligned column of amino acids on the b-face of the protein which was important for environmental differentiation, suggesting that both methods captured distinct evolutionary and ecological aspects of the sequences. Feature importance analysis identified that steric and electronic properties of the protein were associated with predictive ability. Conclusions Our findings highlight the utility of deep learning for classification of diverse biological sequences and provide a framework for combining methods to improve model interpretability and ecological insights.https://doi.org/10.1186/s12915-025-02361-1Deep learningTransfer learningArtificial neural networksMetagenomicsDomain of unknown function 3494Ice-binding proteins
spellingShingle Johanna C. Winder
Simon Poulton
Taoyang Wu
Thomas Mock
Cock van Oosterhout
Environmental adaptations in metagenomes revealed by deep learning
BMC Biology
Deep learning
Transfer learning
Artificial neural networks
Metagenomics
Domain of unknown function 3494
Ice-binding proteins
title Environmental adaptations in metagenomes revealed by deep learning
title_full Environmental adaptations in metagenomes revealed by deep learning
title_fullStr Environmental adaptations in metagenomes revealed by deep learning
title_full_unstemmed Environmental adaptations in metagenomes revealed by deep learning
title_short Environmental adaptations in metagenomes revealed by deep learning
title_sort environmental adaptations in metagenomes revealed by deep learning
topic Deep learning
Transfer learning
Artificial neural networks
Metagenomics
Domain of unknown function 3494
Ice-binding proteins
url https://doi.org/10.1186/s12915-025-02361-1
work_keys_str_mv AT johannacwinder environmentaladaptationsinmetagenomesrevealedbydeeplearning
AT simonpoulton environmentaladaptationsinmetagenomesrevealedbydeeplearning
AT taoyangwu environmentaladaptationsinmetagenomesrevealedbydeeplearning
AT thomasmock environmentaladaptationsinmetagenomesrevealedbydeeplearning
AT cockvanoosterhout environmentaladaptationsinmetagenomesrevealedbydeeplearning