A Compressed Sequence Tag Index for Fast Peptide Retrieval and Efficient Storage in Protein Identification Search Engines
Proteins regulate various cellular processes and are of great biological interest. The protein search engine is a crucial tool in proteomics research, used to analyze high-throughput tandem mass spectrometry data and to identify protein sequence information. A core step in protein search engines is...
Saved in:
| Main Authors: | , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2025-06-01
|
| Series: | Applied Sciences |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2076-3417/15/12/6482 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| Summary: | Proteins regulate various cellular processes and are of great biological interest. The protein search engine is a crucial tool in proteomics research, used to analyze high-throughput tandem mass spectrometry data and to identify protein sequence information. A core step in protein search engines is constructing sequence tag indexes and performing the rapid retrieval of protein databases. However, as the scale of protein sequence data continues to grow, traditional protein search engines face the dual challenges of the high storage cost of sequence tag indexes and low retrieval efficiency. To address these issues, we propose a sequence tag index scheme named STIP, which is based on an inverted index and compression techniques. Based on STIP, we design a peptide retrieval algorithm named STIP-Search. This algorithm utilizes the sequence tag index constructed by STIP for peptide sequence retrieval. STIP uses the greedy algorithm to partition the tag index into blocks; in this way, STIP can generate tag indexes for very large protein databases, such as NCBI-nr. Compared to the current four mainstream tag index generation algorithms used in Open-pFind, MODplus, TIIP and PIPI2, STIP has the lowest storage and time consumption. It utilizes delta encoding, index reduction, and dynamic bit width encoding to compress the tag index, reducing the storage cost by 76.2%. Compared to TIIP, which is currently the algorithm with the lowest time complexity, the time cost of the peptide sequence retrieval of STIP-Search is reduced by 8.94% to 23.31%. |
|---|---|
| ISSN: | 2076-3417 |