A multi-species benchmark for training and validating mass spectrometry proteomics machine learning models

Abstract Training machine learning models for tasks such as de novo sequencing or spectral clustering requires large collections of confidently identified spectra. Here we describe a dataset of 2.8 million high-confidence peptide-spectrum matches derived from nine different species. The dataset is b...

Full description

Saved in:
Bibliographic Details
Main Authors: Bo Wen, William Stafford Noble
Format: Article
Language:English
Published: Nature Portfolio 2024-11-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-024-04068-4
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850062120755920896
author Bo Wen
William Stafford Noble
author_facet Bo Wen
William Stafford Noble
author_sort Bo Wen
collection DOAJ
description Abstract Training machine learning models for tasks such as de novo sequencing or spectral clustering requires large collections of confidently identified spectra. Here we describe a dataset of 2.8 million high-confidence peptide-spectrum matches derived from nine different species. The dataset is based on a previously described benchmark but has been re-processed to ensure consistent data quality and enforce separation of training and test peptides.
format Article
id doaj-art-48c4b8cf1b9a4b65bf8b3e7c239efaad
institution DOAJ
issn 2052-4463
language English
publishDate 2024-11-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj-art-48c4b8cf1b9a4b65bf8b3e7c239efaad2025-08-20T02:50:00ZengNature PortfolioScientific Data2052-44632024-11-011111510.1038/s41597-024-04068-4A multi-species benchmark for training and validating mass spectrometry proteomics machine learning modelsBo Wen0William Stafford Noble1Department of Genome Sciences, University of WashingtonDepartment of Genome Sciences, University of WashingtonAbstract Training machine learning models for tasks such as de novo sequencing or spectral clustering requires large collections of confidently identified spectra. Here we describe a dataset of 2.8 million high-confidence peptide-spectrum matches derived from nine different species. The dataset is based on a previously described benchmark but has been re-processed to ensure consistent data quality and enforce separation of training and test peptides.https://doi.org/10.1038/s41597-024-04068-4
spellingShingle Bo Wen
William Stafford Noble
A multi-species benchmark for training and validating mass spectrometry proteomics machine learning models
Scientific Data
title A multi-species benchmark for training and validating mass spectrometry proteomics machine learning models
title_full A multi-species benchmark for training and validating mass spectrometry proteomics machine learning models
title_fullStr A multi-species benchmark for training and validating mass spectrometry proteomics machine learning models
title_full_unstemmed A multi-species benchmark for training and validating mass spectrometry proteomics machine learning models
title_short A multi-species benchmark for training and validating mass spectrometry proteomics machine learning models
title_sort multi species benchmark for training and validating mass spectrometry proteomics machine learning models
url https://doi.org/10.1038/s41597-024-04068-4
work_keys_str_mv AT bowen amultispeciesbenchmarkfortrainingandvalidatingmassspectrometryproteomicsmachinelearningmodels
AT williamstaffordnoble amultispeciesbenchmarkfortrainingandvalidatingmassspectrometryproteomicsmachinelearningmodels
AT bowen multispeciesbenchmarkfortrainingandvalidatingmassspectrometryproteomicsmachinelearningmodels
AT williamstaffordnoble multispeciesbenchmarkfortrainingandvalidatingmassspectrometryproteomicsmachinelearningmodels