Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species

In bottom-up proteomics, selecting an appropriate protein amino acid sequence database is vital for reliable peptide identification. However, this approach excludes species with unsequenced genomes, limiting the comprehensiveness. This is a major challenge in current microbiota proteomics, a rapidly...

Full description

Saved in:

Bibliographic Details
Main Authors:	Nobuaki Miura, Tsuyoshi Tabata, Yasushi Ishihama, Shujiro Okuda
Format:	Article
Language:	English
Published:	Elsevier 2025-01-01
Series:	Computational and Structural Biotechnology Journal
Subjects:	Amino acid sequence generation Proteomics data analysis Peptide identification Spectral matching Random branch Ion Cover Score
Online Access:	http://www.sciencedirect.com/science/article/pii/S2001037025002041
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849694865915379712
author	Nobuaki Miura Tsuyoshi Tabata Yasushi Ishihama Shujiro Okuda
author_facet	Nobuaki Miura Tsuyoshi Tabata Yasushi Ishihama Shujiro Okuda
author_sort	Nobuaki Miura
collection	DOAJ
description	In bottom-up proteomics, selecting an appropriate protein amino acid sequence database is vital for reliable peptide identification. However, this approach excludes species with unsequenced genomes, limiting the comprehensiveness. This is a major challenge in current microbiota proteomics, a rapidly developing field, which involves simultaneously assigning proteins to species in a sample and analyzing them using databases of protein amino acid sequences with known genomes. We aimed to develop a method to extend the database species diversity by generating protein amino acid sequences of unknown species using phylogenetic relationships among known species. To evaluate this approach, we generated the Helicobacter pylori F16 strain sequence based on the phylogenetic relationships of 29 closely related strains (excluding F16). Consequently, the percentages of peptides that matched the peptides obtained from the reference F16 strain increased by 5 %, based on sequence generation. Proteomics data analyses were performed on the F16 strain using the generated sequence database to validate peptide identification. Peptide spectral match decreased when the database was expanded using sequence generation owing to a decrease in sensitivity primarily caused by an increase in decoy hits. The decrease in identification sensitivity caused by large-scale databases could be improved by introducing a novel score, Ion Cover Score, based on spectral matching. The sequence generation method used in the present study and the introduction of scores based on spectral matching could accelerate proteomics development.
format	Article
id	doaj-art-bbc19f757012410f87ab898091281165
institution	DOAJ
issn	2001-0370
language	English
publishDate	2025-01-01
publisher	Elsevier
record_format	Article
series	Computational and Structural Biotechnology Journal
spelling	doaj-art-bbc19f757012410f87ab8980912811652025-08-20T03:19:56ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-01272313232210.1016/j.csbj.2025.05.041Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown speciesNobuaki Miura0Tsuyoshi Tabata1Yasushi Ishihama2Shujiro Okuda3Division of Bioinformatics, Niigata University Graduate School of Medical and Dental Sciences, 2-5274 Gakkocho-dori, Chuo-ku, Niigata 951-8514, JapanGraduate School of Pharmaceutical Sciences, Kyoto University, Kyoto 606-8501, JapanGraduate School of Pharmaceutical Sciences, Kyoto University, Kyoto 606-8501, JapanDivision of Bioinformatics, Niigata University Graduate School of Medical and Dental Sciences, 2-5274 Gakkocho-dori, Chuo-ku, Niigata 951-8514, Japan; Medical AI Center, Niigata University School of Medicine, 2-5274 Gakkocho-dori, Chuo-ku, Niigata 951-8514, Japan; Corresponding author at: Division of Bioinformatics, Niigata University Graduate School of Medical and Dental Sciences, 2-5274 Gakkocho-dori, Chuo-ku, Niigata 951-8514, JapanIn bottom-up proteomics, selecting an appropriate protein amino acid sequence database is vital for reliable peptide identification. However, this approach excludes species with unsequenced genomes, limiting the comprehensiveness. This is a major challenge in current microbiota proteomics, a rapidly developing field, which involves simultaneously assigning proteins to species in a sample and analyzing them using databases of protein amino acid sequences with known genomes. We aimed to develop a method to extend the database species diversity by generating protein amino acid sequences of unknown species using phylogenetic relationships among known species. To evaluate this approach, we generated the Helicobacter pylori F16 strain sequence based on the phylogenetic relationships of 29 closely related strains (excluding F16). Consequently, the percentages of peptides that matched the peptides obtained from the reference F16 strain increased by 5 %, based on sequence generation. Proteomics data analyses were performed on the F16 strain using the generated sequence database to validate peptide identification. Peptide spectral match decreased when the database was expanded using sequence generation owing to a decrease in sensitivity primarily caused by an increase in decoy hits. The decrease in identification sensitivity caused by large-scale databases could be improved by introducing a novel score, Ion Cover Score, based on spectral matching. The sequence generation method used in the present study and the introduction of scores based on spectral matching could accelerate proteomics development.http://www.sciencedirect.com/science/article/pii/S2001037025002041Amino acid sequence generationProteomics data analysisPeptide identificationSpectral matchingRandom branchIon Cover Score
spellingShingle	Nobuaki Miura Tsuyoshi Tabata Yasushi Ishihama Shujiro Okuda Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species Computational and Structural Biotechnology Journal Amino acid sequence generation Proteomics data analysis Peptide identification Spectral matching Random branch Ion Cover Score
title	Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species
title_full	Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species
title_fullStr	Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species
title_full_unstemmed	Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species
title_short	Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species
title_sort	phylogenetic tree based amino acid sequence generation for proteomics data analysis of unknown species
topic	Amino acid sequence generation Proteomics data analysis Peptide identification Spectral matching Random branch Ion Cover Score
url	http://www.sciencedirect.com/science/article/pii/S2001037025002041
work_keys_str_mv	AT nobuakimiura phylogenetictreebasedaminoacidsequencegenerationforproteomicsdataanalysisofunknownspecies AT tsuyoshitabata phylogenetictreebasedaminoacidsequencegenerationforproteomicsdataanalysisofunknownspecies AT yasushiishihama phylogenetictreebasedaminoacidsequencegenerationforproteomicsdataanalysisofunknownspecies AT shujirookuda phylogenetictreebasedaminoacidsequencegenerationforproteomicsdataanalysisofunknownspecies

Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species

Similar Items