A benchmark study of compression software for human short-read sequence data

Abstract Efficient data compression technologies are crucial to reduce the cost of long-term storage and file transfer in whole genome sequencing studies. This study benchmarked four specialized compression tools developed for paired-end fastq.gz files DRAGEN ORA 4.3.4 (ORA), Genozip 15.0.62, repaq...

Full description

Saved in:
Bibliographic Details
Main Authors: Raphael O. Betschart, Felix Thalén, Stefan Blankenberg, Martin Zoche, Tanja Zeller, Andreas Ziegler
Format: Article
Language:English
Published: Nature Portfolio 2025-05-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-00491-8
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850042589275750400
author Raphael O. Betschart
Felix Thalén
Stefan Blankenberg
Martin Zoche
Tanja Zeller
Andreas Ziegler
author_facet Raphael O. Betschart
Felix Thalén
Stefan Blankenberg
Martin Zoche
Tanja Zeller
Andreas Ziegler
author_sort Raphael O. Betschart
collection DOAJ
description Abstract Efficient data compression technologies are crucial to reduce the cost of long-term storage and file transfer in whole genome sequencing studies. This study benchmarked four specialized compression tools developed for paired-end fastq.gz files DRAGEN ORA 4.3.4 (ORA), Genozip 15.0.62, repaq 0.3.0, and SPRING 1.1.1 using three subjects from the genome-in-a-bottle consortium that were sequenced 82 times on an Illumina NovaSeq 6000, with an average coverage of 35x. It additionally compared Genozip with SAMtools 1.20 for the compression of BAM files. All tools provided lossless compression. ORA and Genozip achieved compression ratios of approximately 1:6 when compressing fastq.gz. repaq and SPRING had lower compression ratios of 1:2 and 1:4, respectively. repaq and SPRING took longer for both compression and decompression than ORA and Genozip. Genozip had approximately 16% higher compression for BAM files than SAMtools. However, the BAM compression of SAMtools produces CRAM files, which are compatible with many software packages. ORA, repaq, and SPRING are limited to compressing fastq.gz files, while Genozip supports various file formats. Although Genozip requires an annual license, its source code is freely available, ensuring sustainability. In conclusion, paired-end short-read sequence data can be efficiently compressed using specialized compression software. Commercial tools offer higher compression ratios than freely available software.
format Article
id doaj-art-a79cd930e56045d290bf0e7c2b7c172d
institution DOAJ
issn 2045-2322
language English
publishDate 2025-05-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-a79cd930e56045d290bf0e7c2b7c172d2025-08-20T02:55:31ZengNature PortfolioScientific Reports2045-23222025-05-011511710.1038/s41598-025-00491-8A benchmark study of compression software for human short-read sequence dataRaphael O. Betschart0Felix Thalén1Stefan Blankenberg2Martin Zoche3Tanja Zeller4Andreas Ziegler5Cardio-CARECardio-CARECardio-CAREInstitute of Pathology and Molecular Pathology, University Hospital ZurichInstitute of Cardiogenetics, University of LübeckCardio-CAREAbstract Efficient data compression technologies are crucial to reduce the cost of long-term storage and file transfer in whole genome sequencing studies. This study benchmarked four specialized compression tools developed for paired-end fastq.gz files DRAGEN ORA 4.3.4 (ORA), Genozip 15.0.62, repaq 0.3.0, and SPRING 1.1.1 using three subjects from the genome-in-a-bottle consortium that were sequenced 82 times on an Illumina NovaSeq 6000, with an average coverage of 35x. It additionally compared Genozip with SAMtools 1.20 for the compression of BAM files. All tools provided lossless compression. ORA and Genozip achieved compression ratios of approximately 1:6 when compressing fastq.gz. repaq and SPRING had lower compression ratios of 1:2 and 1:4, respectively. repaq and SPRING took longer for both compression and decompression than ORA and Genozip. Genozip had approximately 16% higher compression for BAM files than SAMtools. However, the BAM compression of SAMtools produces CRAM files, which are compatible with many software packages. ORA, repaq, and SPRING are limited to compressing fastq.gz files, while Genozip supports various file formats. Although Genozip requires an annual license, its source code is freely available, ensuring sustainability. In conclusion, paired-end short-read sequence data can be efficiently compressed using specialized compression software. Commercial tools offer higher compression ratios than freely available software.https://doi.org/10.1038/s41598-025-00491-8BAMCRAMDNA sequencingFASTQGVCFIllumina sequencing
spellingShingle Raphael O. Betschart
Felix Thalén
Stefan Blankenberg
Martin Zoche
Tanja Zeller
Andreas Ziegler
A benchmark study of compression software for human short-read sequence data
Scientific Reports
BAM
CRAM
DNA sequencing
FASTQ
GVCF
Illumina sequencing
title A benchmark study of compression software for human short-read sequence data
title_full A benchmark study of compression software for human short-read sequence data
title_fullStr A benchmark study of compression software for human short-read sequence data
title_full_unstemmed A benchmark study of compression software for human short-read sequence data
title_short A benchmark study of compression software for human short-read sequence data
title_sort benchmark study of compression software for human short read sequence data
topic BAM
CRAM
DNA sequencing
FASTQ
GVCF
Illumina sequencing
url https://doi.org/10.1038/s41598-025-00491-8
work_keys_str_mv AT raphaelobetschart abenchmarkstudyofcompressionsoftwareforhumanshortreadsequencedata
AT felixthalen abenchmarkstudyofcompressionsoftwareforhumanshortreadsequencedata
AT stefanblankenberg abenchmarkstudyofcompressionsoftwareforhumanshortreadsequencedata
AT martinzoche abenchmarkstudyofcompressionsoftwareforhumanshortreadsequencedata
AT tanjazeller abenchmarkstudyofcompressionsoftwareforhumanshortreadsequencedata
AT andreasziegler abenchmarkstudyofcompressionsoftwareforhumanshortreadsequencedata
AT raphaelobetschart benchmarkstudyofcompressionsoftwareforhumanshortreadsequencedata
AT felixthalen benchmarkstudyofcompressionsoftwareforhumanshortreadsequencedata
AT stefanblankenberg benchmarkstudyofcompressionsoftwareforhumanshortreadsequencedata
AT martinzoche benchmarkstudyofcompressionsoftwareforhumanshortreadsequencedata
AT tanjazeller benchmarkstudyofcompressionsoftwareforhumanshortreadsequencedata
AT andreasziegler benchmarkstudyofcompressionsoftwareforhumanshortreadsequencedata