A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines

Proteins regulate various cellular processes and are of great biological interest. The protein search engine is a crucial tool in proteomics research, used to analyze high-throughput tandem mass spectrometry data and to identify protein sequence information. A core step in protein search engines is...

Full description

Saved in:
Bibliographic Details
Main Authors: Xiaoyu Xie, Yuyue Feng, Piyu Zhou, Di Zhang, Lijin Yao, Haipeng Wang
Format: Article
Language:English
Published: MDPI AG 2025-06-01
Series:Applied Sciences
Subjects:
Online Access:https://www.mdpi.com/2076-3417/15/12/6482
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850157081495076864
author Xiaoyu Xie
Yuyue Feng
Piyu Zhou
Di Zhang
Lijin Yao
Haipeng Wang
author_facet Xiaoyu Xie
Yuyue Feng
Piyu Zhou
Di Zhang
Lijin Yao
Haipeng Wang
author_sort Xiaoyu Xie
collection DOAJ
description Proteins regulate various cellular processes and are of great biological interest. The protein search engine is a crucial tool in proteomics research, used to analyze high-throughput tandem mass spectrometry data and to identify protein sequence information. A core step in protein search engines is constructing sequence tag indexes and performing the rapid retrieval of protein databases. However, as the scale of protein sequence data continues to grow, traditional protein search engines face the dual challenges of the high storage cost of sequence tag indexes and low retrieval efficiency. To address these issues, we propose a sequence tag index scheme named STIP, which is based on an inverted index and compression techniques. Based on STIP, we design a peptide retrieval algorithm named STIP-Search. This algorithm utilizes the sequence tag index constructed by STIP for peptide sequence retrieval. STIP uses the greedy algorithm to partition the tag index into blocks; in this way, STIP can generate tag indexes for very large protein databases, such as NCBI-nr. Compared to the current four mainstream tag index generation algorithms used in Open-pFind, MODplus, TIIP and PIPI2, STIP has the lowest storage and time consumption. It utilizes delta encoding, index reduction, and dynamic bit width encoding to compress the tag index, reducing the storage cost by 76.2%. Compared to TIIP, which is currently the algorithm with the lowest time complexity, the time cost of the peptide sequence retrieval of STIP-Search is reduced by 8.94% to 23.31%.
format Article
id doaj-art-2ddc89bb63ce4649a939b44bc2bef586
institution OA Journals
issn 2076-3417
language English
publishDate 2025-06-01
publisher MDPI AG
record_format Article
series Applied Sciences
spelling doaj-art-2ddc89bb63ce4649a939b44bc2bef5862025-08-20T02:24:17ZengMDPI AGApplied Sciences2076-34172025-06-011512648210.3390/app15126482A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search EnginesXiaoyu Xie0Yuyue Feng1Piyu Zhou2Di Zhang3Lijin Yao4Haipeng Wang5School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, ChinaSchool of Computer Science and Technology, Shandong University of Technology, Zibo 255000, ChinaState Key Laboratory of Mathematical Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, ChinaSchool of Computer Science and Technology, Shandong University of Technology, Zibo 255000, ChinaSchool of Computer Science and Technology, Shandong University of Technology, Zibo 255000, ChinaSchool of Computer Science and Technology, Shandong University of Technology, Zibo 255000, ChinaProteins regulate various cellular processes and are of great biological interest. The protein search engine is a crucial tool in proteomics research, used to analyze high-throughput tandem mass spectrometry data and to identify protein sequence information. A core step in protein search engines is constructing sequence tag indexes and performing the rapid retrieval of protein databases. However, as the scale of protein sequence data continues to grow, traditional protein search engines face the dual challenges of the high storage cost of sequence tag indexes and low retrieval efficiency. To address these issues, we propose a sequence tag index scheme named STIP, which is based on an inverted index and compression techniques. Based on STIP, we design a peptide retrieval algorithm named STIP-Search. This algorithm utilizes the sequence tag index constructed by STIP for peptide sequence retrieval. STIP uses the greedy algorithm to partition the tag index into blocks; in this way, STIP can generate tag indexes for very large protein databases, such as NCBI-nr. Compared to the current four mainstream tag index generation algorithms used in Open-pFind, MODplus, TIIP and PIPI2, STIP has the lowest storage and time consumption. It utilizes delta encoding, index reduction, and dynamic bit width encoding to compress the tag index, reducing the storage cost by 76.2%. Compared to TIIP, which is currently the algorithm with the lowest time complexity, the time cost of the peptide sequence retrieval of STIP-Search is reduced by 8.94% to 23.31%.https://www.mdpi.com/2076-3417/15/12/6482protein identification search enginesequence tag indexinverted indexcompression algorithm
spellingShingle Xiaoyu Xie
Yuyue Feng
Piyu Zhou
Di Zhang
Lijin Yao
Haipeng Wang
A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines
Applied Sciences
protein identification search engine
sequence tag index
inverted index
compression algorithm
title A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines
title_full A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines
title_fullStr A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines
title_full_unstemmed A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines
title_short A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines
title_sort compressed sequence tag index for fast peptide retrieval and efficient storage in protein identification search engines
topic protein identification search engine
sequence tag index
inverted index
compression algorithm
url https://www.mdpi.com/2076-3417/15/12/6482
work_keys_str_mv AT xiaoyuxie acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines
AT yuyuefeng acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines
AT piyuzhou acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines
AT dizhang acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines
AT lijinyao acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines
AT haipengwang acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines
AT xiaoyuxie compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines
AT yuyuefeng compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines
AT piyuzhou compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines
AT dizhang compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines
AT lijinyao compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines
AT haipengwang compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines