Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing data
Accurate and complete de novo genome assemblies enable variant identification and the discovery of novel genomic features and biological functions. However, de novo assemblies of large and complex genomes remain challenging. Long-read sequencing data, alone or combined with short-read data, facilita...
Saved in:
| Main Authors: | , , , , , , , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Elsevier
2025-01-01
|
| Series: | Computational and Structural Biotechnology Journal |
| Subjects: | |
| Online Access: | http://www.sciencedirect.com/science/article/pii/S2001037025002867 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849720071592607744 |
|---|---|
| author | Adrián Muñoz-Barrera Luis A. Rubio-Rodríguez David Jáspez Almudena Corrales Itahisa Marcelino-Rodriguez Lourdes Ortiz Pablo Mendoza José M. Lorenzo-Salazar Rafaela González-Montelongo Carlos Flores |
| author_facet | Adrián Muñoz-Barrera Luis A. Rubio-Rodríguez David Jáspez Almudena Corrales Itahisa Marcelino-Rodriguez Lourdes Ortiz Pablo Mendoza José M. Lorenzo-Salazar Rafaela González-Montelongo Carlos Flores |
| author_sort | Adrián Muñoz-Barrera |
| collection | DOAJ |
| description | Accurate and complete de novo genome assemblies enable variant identification and the discovery of novel genomic features and biological functions. However, de novo assemblies of large and complex genomes remain challenging. Long-read sequencing data, alone or combined with short-read data, facilitate genome assembly. However, the literature has limited comprehensive evaluations of software performance, especially for human genome assembly. We benchmarked 11 pipelines, including four long-read only assemblers and three hybrid assemblers, combined with four polishing schemes, using the HG002 human reference material sequenced with Oxford Nanopore Technologies and Illumina. The best-performing pipeline was validated with non-reference human and non-human routine laboratory samples. Software performance was assessed using QUAST, BUSCO, and Merqury metrics, alongside computational cost analyses. We found that Flye outperformed all assemblers, particularly with Ratatosk error-corrected long-reads. Polishing improved the assembly accuracy and continuity, with two rounds of Racon and Pilon yielding the best results. The assembly of data from validation samples showed comparable assembly metrics to those of the reference material. Based on the results, a complete optimal analysis pipeline for the assembly, polishing, and contig curation developed on Nextflow is provided to enable efficient parallelization and built-in dependency management to further advance the generation of high-quality and chromosome-level assemblies. |
| format | Article |
| id | doaj-art-5fded13f85594088a5b0bb9dec15207e |
| institution | DOAJ |
| issn | 2001-0370 |
| language | English |
| publishDate | 2025-01-01 |
| publisher | Elsevier |
| record_format | Article |
| series | Computational and Structural Biotechnology Journal |
| spelling | doaj-art-5fded13f85594088a5b0bb9dec15207e2025-08-20T03:12:01ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-01273099310910.1016/j.csbj.2025.07.020Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing dataAdrián Muñoz-Barrera0Luis A. Rubio-Rodríguez1David Jáspez2Almudena Corrales3Itahisa Marcelino-Rodriguez4Lourdes Ortiz5Pablo Mendoza6José M. Lorenzo-Salazar7Rafaela González-Montelongo8Carlos Flores9Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, SpainGenomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, SpainGenomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, SpainResearch Unit, Hospital Universitario Nuestra Señora de Candelaria, Instituto de Investigación Sanitaria de Canarias, Santa Cruz de Tenerife, Spain; CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, SpainPreventive Medicine and Public Health Area, Universidad de La Laguna, San Cristóbal de La Laguna, Spain; Institute of Biomedical Technologies, Universidad de La Laguna, San Cristóbal de La Laguna, SpainDepartment of Research and Development in Molecular Diagnostic, Vircell S.L., Granada, SpainDepartment of Research and Development in Molecular Diagnostic, Vircell S.L., Granada, SpainGenomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, SpainGenomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain; Plataforma Genómica de Alto Rendimiento para el Estudio de la Biodiversidad, Associated Unit to Consejo Superior de Investigaciones Científicas (CSIC) by Instituto de Productos Naturales y Agrobiología (IPNA), San Cristóbal de La Laguna, SpainGenomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain; Research Unit, Hospital Universitario Nuestra Señora de Candelaria, Instituto de Investigación Sanitaria de Canarias, Santa Cruz de Tenerife, Spain; CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, Spain; Plataforma Genómica de Alto Rendimiento para el Estudio de la Biodiversidad, Associated Unit to Consejo Superior de Investigaciones Científicas (CSIC) by Instituto de Productos Naturales y Agrobiología (IPNA), San Cristóbal de La Laguna, Spain; Facultad de Ciencias de la Salud, Universidad Fernando de Pessoa Canarias, Las Palmas de Gran Canaria, Spain; Corresponding author at: Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain.Accurate and complete de novo genome assemblies enable variant identification and the discovery of novel genomic features and biological functions. However, de novo assemblies of large and complex genomes remain challenging. Long-read sequencing data, alone or combined with short-read data, facilitate genome assembly. However, the literature has limited comprehensive evaluations of software performance, especially for human genome assembly. We benchmarked 11 pipelines, including four long-read only assemblers and three hybrid assemblers, combined with four polishing schemes, using the HG002 human reference material sequenced with Oxford Nanopore Technologies and Illumina. The best-performing pipeline was validated with non-reference human and non-human routine laboratory samples. Software performance was assessed using QUAST, BUSCO, and Merqury metrics, alongside computational cost analyses. We found that Flye outperformed all assemblers, particularly with Ratatosk error-corrected long-reads. Polishing improved the assembly accuracy and continuity, with two rounds of Racon and Pilon yielding the best results. The assembly of data from validation samples showed comparable assembly metrics to those of the reference material. Based on the results, a complete optimal analysis pipeline for the assembly, polishing, and contig curation developed on Nextflow is provided to enable efficient parallelization and built-in dependency management to further advance the generation of high-quality and chromosome-level assemblies.http://www.sciencedirect.com/science/article/pii/S2001037025002867Long-read sequencingNanoporeWGSDe novo genome assembly |
| spellingShingle | Adrián Muñoz-Barrera Luis A. Rubio-Rodríguez David Jáspez Almudena Corrales Itahisa Marcelino-Rodriguez Lourdes Ortiz Pablo Mendoza José M. Lorenzo-Salazar Rafaela González-Montelongo Carlos Flores Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing data Computational and Structural Biotechnology Journal Long-read sequencing Nanopore WGS De novo genome assembly |
| title | Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing data |
| title_full | Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing data |
| title_fullStr | Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing data |
| title_full_unstemmed | Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing data |
| title_short | Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing data |
| title_sort | benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non human whole genome sequencing data |
| topic | Long-read sequencing Nanopore WGS De novo genome assembly |
| url | http://www.sciencedirect.com/science/article/pii/S2001037025002867 |
| work_keys_str_mv | AT adrianmunozbarrera benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata AT luisarubiorodriguez benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata AT davidjaspez benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata AT almudenacorrales benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata AT itahisamarcelinorodriguez benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata AT lourdesortiz benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata AT pablomendoza benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata AT josemlorenzosalazar benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata AT rafaelagonzalezmontelongo benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata AT carlosflores benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata |