A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines

Proteins regulate various cellular processes and are of great biological interest. The protein search engine is a crucial tool in proteomics research, used to analyze high-throughput tandem mass spectrometry data and to identify protein sequence information. A core step in protein search engines is...

Full description

Saved in:

Bibliographic Details
Main Authors:	Xiaoyu Xie, Yuyue Feng, Piyu Zhou, Di Zhang, Lijin Yao, Haipeng Wang
Format:	Article
Language:	English
Published:	MDPI AG 2025-06-01
Series:	Applied Sciences
Subjects:	protein identification search engine sequence tag index inverted index compression algorithm
Online Access:	https://www.mdpi.com/2076-3417/15/12/6482
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850157081495076864
author	Xiaoyu Xie Yuyue Feng Piyu Zhou Di Zhang Lijin Yao Haipeng Wang
author_facet	Xiaoyu Xie Yuyue Feng Piyu Zhou Di Zhang Lijin Yao Haipeng Wang
author_sort	Xiaoyu Xie
collection	DOAJ
description	Proteins regulate various cellular processes and are of great biological interest. The protein search engine is a crucial tool in proteomics research, used to analyze high-throughput tandem mass spectrometry data and to identify protein sequence information. A core step in protein search engines is constructing sequence tag indexes and performing the rapid retrieval of protein databases. However, as the scale of protein sequence data continues to grow, traditional protein search engines face the dual challenges of the high storage cost of sequence tag indexes and low retrieval efficiency. To address these issues, we propose a sequence tag index scheme named STIP, which is based on an inverted index and compression techniques. Based on STIP, we design a peptide retrieval algorithm named STIP-Search. This algorithm utilizes the sequence tag index constructed by STIP for peptide sequence retrieval. STIP uses the greedy algorithm to partition the tag index into blocks; in this way, STIP can generate tag indexes for very large protein databases, such as NCBI-nr. Compared to the current four mainstream tag index generation algorithms used in Open-pFind, MODplus, TIIP and PIPI2, STIP has the lowest storage and time consumption. It utilizes delta encoding, index reduction, and dynamic bit width encoding to compress the tag index, reducing the storage cost by 76.2%. Compared to TIIP, which is currently the algorithm with the lowest time complexity, the time cost of the peptide sequence retrieval of STIP-Search is reduced by 8.94% to 23.31%.
format	Article
id	doaj-art-2ddc89bb63ce4649a939b44bc2bef586
institution	OA Journals
issn	2076-3417
language	English
publishDate	2025-06-01
publisher	MDPI AG
record_format	Article
series	Applied Sciences
spelling	doaj-art-2ddc89bb63ce4649a939b44bc2bef5862025-08-20T02:24:17ZengMDPI AGApplied Sciences2076-34172025-06-011512648210.3390/app15126482A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search EnginesXiaoyu Xie0Yuyue Feng1Piyu Zhou2Di Zhang3Lijin Yao4Haipeng Wang5School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, ChinaSchool of Computer Science and Technology, Shandong University of Technology, Zibo 255000, ChinaState Key Laboratory of Mathematical Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, ChinaSchool of Computer Science and Technology, Shandong University of Technology, Zibo 255000, ChinaSchool of Computer Science and Technology, Shandong University of Technology, Zibo 255000, ChinaSchool of Computer Science and Technology, Shandong University of Technology, Zibo 255000, ChinaProteins regulate various cellular processes and are of great biological interest. The protein search engine is a crucial tool in proteomics research, used to analyze high-throughput tandem mass spectrometry data and to identify protein sequence information. A core step in protein search engines is constructing sequence tag indexes and performing the rapid retrieval of protein databases. However, as the scale of protein sequence data continues to grow, traditional protein search engines face the dual challenges of the high storage cost of sequence tag indexes and low retrieval efficiency. To address these issues, we propose a sequence tag index scheme named STIP, which is based on an inverted index and compression techniques. Based on STIP, we design a peptide retrieval algorithm named STIP-Search. This algorithm utilizes the sequence tag index constructed by STIP for peptide sequence retrieval. STIP uses the greedy algorithm to partition the tag index into blocks; in this way, STIP can generate tag indexes for very large protein databases, such as NCBI-nr. Compared to the current four mainstream tag index generation algorithms used in Open-pFind, MODplus, TIIP and PIPI2, STIP has the lowest storage and time consumption. It utilizes delta encoding, index reduction, and dynamic bit width encoding to compress the tag index, reducing the storage cost by 76.2%. Compared to TIIP, which is currently the algorithm with the lowest time complexity, the time cost of the peptide sequence retrieval of STIP-Search is reduced by 8.94% to 23.31%.https://www.mdpi.com/2076-3417/15/12/6482protein identification search enginesequence tag indexinverted indexcompression algorithm
spellingShingle	Xiaoyu Xie Yuyue Feng Piyu Zhou Di Zhang Lijin Yao Haipeng Wang A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines Applied Sciences protein identification search engine sequence tag index inverted index compression algorithm
title	A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines
title_full	A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines
title_fullStr	A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines
title_full_unstemmed	A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines
title_short	A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines
title_sort	compressed sequence tag index for fast peptide retrieval and efficient storage in protein identification search engines
topic	protein identification search engine sequence tag index inverted index compression algorithm
url	https://www.mdpi.com/2076-3417/15/12/6482
work_keys_str_mv	AT xiaoyuxie acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT yuyuefeng acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT piyuzhou acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT dizhang acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT lijinyao acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT haipengwang acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT xiaoyuxie compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT yuyuefeng compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT piyuzhou compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT dizhang compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT lijinyao compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT haipengwang compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines

A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines

Similar Items