Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species
In bottom-up proteomics, selecting an appropriate protein amino acid sequence database is vital for reliable peptide identification. However, this approach excludes species with unsequenced genomes, limiting the comprehensiveness. This is a major challenge in current microbiota proteomics, a rapidly...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-01-01
|
| Series: | Computational and Structural Biotechnology Journal |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2001037025002041 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849694865915379712 |
|---|---|
| author | Nobuaki Miura Tsuyoshi Tabata Yasushi Ishihama Shujiro Okuda |
| author_facet | Nobuaki Miura Tsuyoshi Tabata Yasushi Ishihama Shujiro Okuda |
| author_sort | Nobuaki Miura |
| collection | DOAJ |
| description | In bottom-up proteomics, selecting an appropriate protein amino acid sequence database is vital for reliable peptide identification. However, this approach excludes species with unsequenced genomes, limiting the comprehensiveness. This is a major challenge in current microbiota proteomics, a rapidly developing field, which involves simultaneously assigning proteins to species in a sample and analyzing them using databases of protein amino acid sequences with known genomes. We aimed to develop a method to extend the database species diversity by generating protein amino acid sequences of unknown species using phylogenetic relationships among known species. To evaluate this approach, we generated the Helicobacter pylori F16 strain sequence based on the phylogenetic relationships of 29 closely related strains (excluding F16). Consequently, the percentages of peptides that matched the peptides obtained from the reference F16 strain increased by 5 %, based on sequence generation. Proteomics data analyses were performed on the F16 strain using the generated sequence database to validate peptide identification. Peptide spectral match decreased when the database was expanded using sequence generation owing to a decrease in sensitivity primarily caused by an increase in decoy hits. The decrease in identification sensitivity caused by large-scale databases could be improved by introducing a novel score, Ion Cover Score, based on spectral matching. The sequence generation method used in the present study and the introduction of scores based on spectral matching could accelerate proteomics development. |
| format | Article |
| id | doaj-art-bbc19f757012410f87ab898091281165 |
| institution | DOAJ |
| issn | 2001-0370 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Computational and Structural Biotechnology Journal |
| spelling | doaj-art-bbc19f757012410f87ab8980912811652025-08-20T03:19:56ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-01272313232210.1016/j.csbj.2025.05.041Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown speciesNobuaki Miura0Tsuyoshi Tabata1Yasushi Ishihama2Shujiro Okuda3Division of Bioinformatics, Niigata University Graduate School of Medical and Dental Sciences, 2-5274 Gakkocho-dori, Chuo-ku, Niigata 951-8514, JapanGraduate School of Pharmaceutical Sciences, Kyoto University, Kyoto 606-8501, JapanGraduate School of Pharmaceutical Sciences, Kyoto University, Kyoto 606-8501, JapanDivision of Bioinformatics, Niigata University Graduate School of Medical and Dental Sciences, 2-5274 Gakkocho-dori, Chuo-ku, Niigata 951-8514, Japan; Medical AI Center, Niigata University School of Medicine, 2-5274 Gakkocho-dori, Chuo-ku, Niigata 951-8514, Japan; Corresponding author at: Division of Bioinformatics, Niigata University Graduate School of Medical and Dental Sciences, 2-5274 Gakkocho-dori, Chuo-ku, Niigata 951-8514, JapanIn bottom-up proteomics, selecting an appropriate protein amino acid sequence database is vital for reliable peptide identification. However, this approach excludes species with unsequenced genomes, limiting the comprehensiveness. This is a major challenge in current microbiota proteomics, a rapidly developing field, which involves simultaneously assigning proteins to species in a sample and analyzing them using databases of protein amino acid sequences with known genomes. We aimed to develop a method to extend the database species diversity by generating protein amino acid sequences of unknown species using phylogenetic relationships among known species. To evaluate this approach, we generated the Helicobacter pylori F16 strain sequence based on the phylogenetic relationships of 29 closely related strains (excluding F16). Consequently, the percentages of peptides that matched the peptides obtained from the reference F16 strain increased by 5 %, based on sequence generation. Proteomics data analyses were performed on the F16 strain using the generated sequence database to validate peptide identification. Peptide spectral match decreased when the database was expanded using sequence generation owing to a decrease in sensitivity primarily caused by an increase in decoy hits. The decrease in identification sensitivity caused by large-scale databases could be improved by introducing a novel score, Ion Cover Score, based on spectral matching. The sequence generation method used in the present study and the introduction of scores based on spectral matching could accelerate proteomics development.http://www.sciencedirect.com/science/article/pii/S2001037025002041Amino acid sequence generationProteomics data analysisPeptide identificationSpectral matchingRandom branchIon Cover Score |
| spellingShingle | Nobuaki Miura Tsuyoshi Tabata Yasushi Ishihama Shujiro Okuda Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species Computational and Structural Biotechnology Journal Amino acid sequence generation Proteomics data analysis Peptide identification Spectral matching Random branch Ion Cover Score |
| title | Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species |
| title_full | Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species |
| title_fullStr | Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species |
| title_full_unstemmed | Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species |
| title_short | Phylogenetic tree-based amino acid sequence generation for proteomics data analysis of unknown species |
| title_sort | phylogenetic tree based amino acid sequence generation for proteomics data analysis of unknown species |
| topic | Amino acid sequence generation Proteomics data analysis Peptide identification Spectral matching Random branch Ion Cover Score |
| url | http://www.sciencedirect.com/science/article/pii/S2001037025002041 |
| work_keys_str_mv | AT nobuakimiura phylogenetictreebasedaminoacidsequencegenerationforproteomicsdataanalysisofunknownspecies AT tsuyoshitabata phylogenetictreebasedaminoacidsequencegenerationforproteomicsdataanalysisofunknownspecies AT yasushiishihama phylogenetictreebasedaminoacidsequencegenerationforproteomicsdataanalysisofunknownspecies AT shujirookuda phylogenetictreebasedaminoacidsequencegenerationforproteomicsdataanalysisofunknownspecies |