NetStart 2.0: prediction of eukaryotic translation initiation sites using a protein language model

Abstract Background Accurate identification of translation initiation sites is essential for the proper translation of mRNA into functional proteins. In eukaryotes, the choice of the translation initiation site is influenced by multiple factors, including its proximity to the 5 $$^\prime $$ end and...

Full description

Saved in:
Bibliographic Details
Main Authors: Line Sandvad Nielsen, Anders Gorm Pedersen, Ole Winther, Henrik Nielsen
Format: Article
Language:English
Published: BMC 2025-08-01
Series:BMC Bioinformatics
Subjects:
Online Access:https://doi.org/10.1186/s12859-025-06220-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849225813349629952
author Line Sandvad Nielsen
Anders Gorm Pedersen
Ole Winther
Henrik Nielsen
author_facet Line Sandvad Nielsen
Anders Gorm Pedersen
Ole Winther
Henrik Nielsen
author_sort Line Sandvad Nielsen
collection DOAJ
description Abstract Background Accurate identification of translation initiation sites is essential for the proper translation of mRNA into functional proteins. In eukaryotes, the choice of the translation initiation site is influenced by multiple factors, including its proximity to the 5 $$^\prime $$ end and the local start codon context. Translation initiation sites mark the transition from non-coding to coding regions. This fact motivates the expectation that the upstream sequence, if translated, would assemble a nonsensical order of amino acids, while the downstream sequence would correspond to the structured beginning of a protein. This distinction suggests potential for predicting translation initiation sites using a protein language model. Results We present NetStart 2.0, a deep learning-based model that integrates the ESM-2 protein language model with the local sequence context to predict translation initiation sites across a broad range of eukaryotic species. NetStart 2.0 was trained as a single model across multiple species, and despite the broad phylogenetic diversity represented in the training data, it consistently relied on features marking the transition from non-coding to coding regions. Conclusion By leveraging “protein-ness”, NetStart 2.0 achieves state-of-the-art performance in predicting translation initiation sites across a diverse range of eukaryotic species. This success underscores the potential of protein language models to bridge transcript- and peptide-level information in complex biological prediction tasks. The NetStart 2.0 webserver is available at: https://services.healthtech.dtu.dk/services/NetStart-2.0/ .
format Article
id doaj-art-07f2ff9fadb54390a9b54bca9529dd0e
institution Kabale University
issn 1471-2105
language English
publishDate 2025-08-01
publisher BMC
record_format Article
series BMC Bioinformatics
spelling doaj-art-07f2ff9fadb54390a9b54bca9529dd0e2025-08-24T11:54:34ZengBMCBMC Bioinformatics1471-21052025-08-0126112210.1186/s12859-025-06220-2NetStart 2.0: prediction of eukaryotic translation initiation sites using a protein language modelLine Sandvad Nielsen0Anders Gorm Pedersen1Ole Winther2Henrik Nielsen3Section for Computational and RNA Biology, Department of Biology, University of CopenhagenSection for Bioinformatics, Department of Health Technology, Technical University of DenmarkSection for Computational and RNA Biology, Department of Biology, University of CopenhagenSection for Bioinformatics, Department of Health Technology, Technical University of DenmarkAbstract Background Accurate identification of translation initiation sites is essential for the proper translation of mRNA into functional proteins. In eukaryotes, the choice of the translation initiation site is influenced by multiple factors, including its proximity to the 5 $$^\prime $$ end and the local start codon context. Translation initiation sites mark the transition from non-coding to coding regions. This fact motivates the expectation that the upstream sequence, if translated, would assemble a nonsensical order of amino acids, while the downstream sequence would correspond to the structured beginning of a protein. This distinction suggests potential for predicting translation initiation sites using a protein language model. Results We present NetStart 2.0, a deep learning-based model that integrates the ESM-2 protein language model with the local sequence context to predict translation initiation sites across a broad range of eukaryotic species. NetStart 2.0 was trained as a single model across multiple species, and despite the broad phylogenetic diversity represented in the training data, it consistently relied on features marking the transition from non-coding to coding regions. Conclusion By leveraging “protein-ness”, NetStart 2.0 achieves state-of-the-art performance in predicting translation initiation sites across a diverse range of eukaryotic species. This success underscores the potential of protein language models to bridge transcript- and peptide-level information in complex biological prediction tasks. The NetStart 2.0 webserver is available at: https://services.healthtech.dtu.dk/services/NetStart-2.0/ .https://doi.org/10.1186/s12859-025-06220-2Protein language modelsTranslation initiation sitesStart codonsDeep learning“Protein-ness”Coding potential
spellingShingle Line Sandvad Nielsen
Anders Gorm Pedersen
Ole Winther
Henrik Nielsen
NetStart 2.0: prediction of eukaryotic translation initiation sites using a protein language model
BMC Bioinformatics
Protein language models
Translation initiation sites
Start codons
Deep learning
“Protein-ness”
Coding potential
title NetStart 2.0: prediction of eukaryotic translation initiation sites using a protein language model
title_full NetStart 2.0: prediction of eukaryotic translation initiation sites using a protein language model
title_fullStr NetStart 2.0: prediction of eukaryotic translation initiation sites using a protein language model
title_full_unstemmed NetStart 2.0: prediction of eukaryotic translation initiation sites using a protein language model
title_short NetStart 2.0: prediction of eukaryotic translation initiation sites using a protein language model
title_sort netstart 2 0 prediction of eukaryotic translation initiation sites using a protein language model
topic Protein language models
Translation initiation sites
Start codons
Deep learning
“Protein-ness”
Coding potential
url https://doi.org/10.1186/s12859-025-06220-2
work_keys_str_mv AT linesandvadnielsen netstart20predictionofeukaryotictranslationinitiationsitesusingaproteinlanguagemodel
AT andersgormpedersen netstart20predictionofeukaryotictranslationinitiationsitesusingaproteinlanguagemodel
AT olewinther netstart20predictionofeukaryotictranslationinitiationsitesusingaproteinlanguagemodel
AT henriknielsen netstart20predictionofeukaryotictranslationinitiationsitesusingaproteinlanguagemodel