Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species

In bottom-up proteomics, selecting an appropriate protein amino acid sequence database is vital for reliable peptide identification. However, this approach excludes species with unsequenced genomes, limiting the comprehensiveness. This is a major challenge in current microbiota proteomics, a rapidly...

Full description

Saved in:
Bibliographic Details
Main Authors: Nobuaki Miura, Tsuyoshi Tabata, Yasushi Ishihama, Shujiro Okuda
Format: Article
Language:English
Published: Elsevier 2025-01-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037025002041
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849694865915379712
author Nobuaki Miura
Tsuyoshi Tabata
Yasushi Ishihama
Shujiro Okuda
author_facet Nobuaki Miura
Tsuyoshi Tabata
Yasushi Ishihama
Shujiro Okuda
author_sort Nobuaki Miura
collection DOAJ
description In bottom-up proteomics, selecting an appropriate protein amino acid sequence database is vital for reliable peptide identification. However, this approach excludes species with unsequenced genomes, limiting the comprehensiveness. This is a major challenge in current microbiota proteomics, a rapidly developing field, which involves simultaneously assigning proteins to species in a sample and analyzing them using databases of protein amino acid sequences with known genomes. We aimed to develop a method to extend the database species diversity by generating protein amino acid sequences of unknown species using phylogenetic relationships among known species. To evaluate this approach, we generated the Helicobacter pylori F16 strain sequence based on the phylogenetic relationships of 29 closely related strains (excluding F16). Consequently, the percentages of peptides that matched the peptides obtained from the reference F16 strain increased by 5 %, based on sequence generation. Proteomics data analyses were performed on the F16 strain using the generated sequence database to validate peptide identification. Peptide spectral match decreased when the database was expanded using sequence generation owing to a decrease in sensitivity primarily caused by an increase in decoy hits. The decrease in identification sensitivity caused by large-scale databases could be improved by introducing a novel score, Ion Cover Score, based on spectral matching. The sequence generation method used in the present study and the introduction of scores based on spectral matching could accelerate proteomics development.
format Article
id doaj-art-bbc19f757012410f87ab898091281165
institution DOAJ
issn 2001-0370
language English
publishDate 2025-01-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj-art-bbc19f757012410f87ab8980912811652025-08-20T03:19:56ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-01272313232210.1016/j.csbj.2025.05.041Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown speciesNobuaki Miura0Tsuyoshi Tabata1Yasushi Ishihama2Shujiro Okuda3Division of Bioinformatics, Niigata University Graduate School of Medical and Dental Sciences, 2-5274 Gakkocho-dori, Chuo-ku, Niigata 951-8514, JapanGraduate School of Pharmaceutical Sciences, Kyoto University, Kyoto 606-8501, JapanGraduate School of Pharmaceutical Sciences, Kyoto University, Kyoto 606-8501, JapanDivision of Bioinformatics, Niigata University Graduate School of Medical and Dental Sciences, 2-5274 Gakkocho-dori, Chuo-ku, Niigata 951-8514, Japan; Medical AI Center, Niigata University School of Medicine, 2-5274 Gakkocho-dori, Chuo-ku, Niigata 951-8514, Japan; Corresponding author at: Division of Bioinformatics, Niigata University Graduate School of Medical and Dental Sciences, 2-5274 Gakkocho-dori, Chuo-ku, Niigata 951-8514, JapanIn bottom-up proteomics, selecting an appropriate protein amino acid sequence database is vital for reliable peptide identification. However, this approach excludes species with unsequenced genomes, limiting the comprehensiveness. This is a major challenge in current microbiota proteomics, a rapidly developing field, which involves simultaneously assigning proteins to species in a sample and analyzing them using databases of protein amino acid sequences with known genomes. We aimed to develop a method to extend the database species diversity by generating protein amino acid sequences of unknown species using phylogenetic relationships among known species. To evaluate this approach, we generated the Helicobacter pylori F16 strain sequence based on the phylogenetic relationships of 29 closely related strains (excluding F16). Consequently, the percentages of peptides that matched the peptides obtained from the reference F16 strain increased by 5 %, based on sequence generation. Proteomics data analyses were performed on the F16 strain using the generated sequence database to validate peptide identification. Peptide spectral match decreased when the database was expanded using sequence generation owing to a decrease in sensitivity primarily caused by an increase in decoy hits. The decrease in identification sensitivity caused by large-scale databases could be improved by introducing a novel score, Ion Cover Score, based on spectral matching. The sequence generation method used in the present study and the introduction of scores based on spectral matching could accelerate proteomics development.http://www.sciencedirect.com/science/article/pii/S2001037025002041Amino acid sequence generationProteomics data analysisPeptide identificationSpectral matchingRandom branchIon Cover Score
spellingShingle Nobuaki Miura
Tsuyoshi Tabata
Yasushi Ishihama
Shujiro Okuda
Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species
Computational and Structural Biotechnology Journal
Amino acid sequence generation
Proteomics data analysis
Peptide identification
Spectral matching
Random branch
Ion Cover Score
title Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species
title_full Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species
title_fullStr Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species
title_full_unstemmed Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species
title_short Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species
title_sort phylogenetic tree based amino acid sequence generation for proteomics data analysis of unknown species
topic Amino acid sequence generation
Proteomics data analysis
Peptide identification
Spectral matching
Random branch
Ion Cover Score
url http://www.sciencedirect.com/science/article/pii/S2001037025002041
work_keys_str_mv AT nobuakimiura phylogenetictreebasedaminoacidsequencegenerationforproteomicsdataanalysisofunknownspecies
AT tsuyoshitabata phylogenetictreebasedaminoacidsequencegenerationforproteomicsdataanalysisofunknownspecies
AT yasushiishihama phylogenetictreebasedaminoacidsequencegenerationforproteomicsdataanalysisofunknownspecies
AT shujirookuda phylogenetictreebasedaminoacidsequencegenerationforproteomicsdataanalysisofunknownspecies