A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines
Proteins regulate various cellular processes and are of great biological interest. The protein search engine is a crucial tool in proteomics research, used to analyze high-throughput tandem mass spectrometry data and to identify protein sequence information. A core step in protein search engines is...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-06-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/12/6482 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850157081495076864 |
|---|---|
| author | Xiaoyu Xie Yuyue Feng Piyu Zhou Di Zhang Lijin Yao Haipeng Wang |
| author_facet | Xiaoyu Xie Yuyue Feng Piyu Zhou Di Zhang Lijin Yao Haipeng Wang |
| author_sort | Xiaoyu Xie |
| collection | DOAJ |
| description | Proteins regulate various cellular processes and are of great biological interest. The protein search engine is a crucial tool in proteomics research, used to analyze high-throughput tandem mass spectrometry data and to identify protein sequence information. A core step in protein search engines is constructing sequence tag indexes and performing the rapid retrieval of protein databases. However, as the scale of protein sequence data continues to grow, traditional protein search engines face the dual challenges of the high storage cost of sequence tag indexes and low retrieval efficiency. To address these issues, we propose a sequence tag index scheme named STIP, which is based on an inverted index and compression techniques. Based on STIP, we design a peptide retrieval algorithm named STIP-Search. This algorithm utilizes the sequence tag index constructed by STIP for peptide sequence retrieval. STIP uses the greedy algorithm to partition the tag index into blocks; in this way, STIP can generate tag indexes for very large protein databases, such as NCBI-nr. Compared to the current four mainstream tag index generation algorithms used in Open-pFind, MODplus, TIIP and PIPI2, STIP has the lowest storage and time consumption. It utilizes delta encoding, index reduction, and dynamic bit width encoding to compress the tag index, reducing the storage cost by 76.2%. Compared to TIIP, which is currently the algorithm with the lowest time complexity, the time cost of the peptide sequence retrieval of STIP-Search is reduced by 8.94% to 23.31%. |
| format | Article |
| id | doaj-art-2ddc89bb63ce4649a939b44bc2bef586 |
| institution | OA Journals |
| issn | 2076-3417 |
| language | English |
| publishDate | 2025-06-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Applied Sciences |
| spelling | doaj-art-2ddc89bb63ce4649a939b44bc2bef5862025-08-20T02:24:17ZengMDPI AGApplied Sciences2076-34172025-06-011512648210.3390/app15126482A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search EnginesXiaoyu Xie0Yuyue Feng1Piyu Zhou2Di Zhang3Lijin Yao4Haipeng Wang5School of Computer Science and Technology, Shandong University of Technology, Zibo 255000, ChinaSchool of Computer Science and Technology, Shandong University of Technology, Zibo 255000, ChinaState Key Laboratory of Mathematical Science, Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, ChinaSchool of Computer Science and Technology, Shandong University of Technology, Zibo 255000, ChinaSchool of Computer Science and Technology, Shandong University of Technology, Zibo 255000, ChinaSchool of Computer Science and Technology, Shandong University of Technology, Zibo 255000, ChinaProteins regulate various cellular processes and are of great biological interest. The protein search engine is a crucial tool in proteomics research, used to analyze high-throughput tandem mass spectrometry data and to identify protein sequence information. A core step in protein search engines is constructing sequence tag indexes and performing the rapid retrieval of protein databases. However, as the scale of protein sequence data continues to grow, traditional protein search engines face the dual challenges of the high storage cost of sequence tag indexes and low retrieval efficiency. To address these issues, we propose a sequence tag index scheme named STIP, which is based on an inverted index and compression techniques. Based on STIP, we design a peptide retrieval algorithm named STIP-Search. This algorithm utilizes the sequence tag index constructed by STIP for peptide sequence retrieval. STIP uses the greedy algorithm to partition the tag index into blocks; in this way, STIP can generate tag indexes for very large protein databases, such as NCBI-nr. Compared to the current four mainstream tag index generation algorithms used in Open-pFind, MODplus, TIIP and PIPI2, STIP has the lowest storage and time consumption. It utilizes delta encoding, index reduction, and dynamic bit width encoding to compress the tag index, reducing the storage cost by 76.2%. Compared to TIIP, which is currently the algorithm with the lowest time complexity, the time cost of the peptide sequence retrieval of STIP-Search is reduced by 8.94% to 23.31%.https://www.mdpi.com/2076-3417/15/12/6482protein identification search enginesequence tag indexinverted indexcompression algorithm |
| spellingShingle | Xiaoyu Xie Yuyue Feng Piyu Zhou Di Zhang Lijin Yao Haipeng Wang A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines Applied Sciences protein identification search engine sequence tag index inverted index compression algorithm |
| title | A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines |
| title_full | A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines |
| title_fullStr | A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines |
| title_full_unstemmed | A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines |
| title_short | A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines |
| title_sort | compressed sequence tag index for fast peptide retrieval and efficient storage in protein identification search engines |
| topic | protein identification search engine sequence tag index inverted index compression algorithm |
| url | https://www.mdpi.com/2076-3417/15/12/6482 |
| work_keys_str_mv | AT xiaoyuxie acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT yuyuefeng acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT piyuzhou acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT dizhang acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT lijinyao acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT haipengwang acompressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT xiaoyuxie compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT yuyuefeng compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT piyuzhou compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT dizhang compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT lijinyao compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines AT haipengwang compressedsequencetagindexforfastpeptideretrievalandefficientstorageinproteinidentificationsearchengines |