Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing data

Accurate and complete de novo genome assemblies enable variant identification and the discovery of novel genomic features and biological functions. However, de novo assemblies of large and complex genomes remain challenging. Long-read sequencing data, alone or combined with short-read data, facilita...

Full description

Saved in:
Bibliographic Details
Main Authors: Adrián Muñoz-Barrera, Luis A. Rubio-Rodríguez, David Jáspez, Almudena Corrales, Itahisa Marcelino-Rodriguez, Lourdes Ortiz, Pablo Mendoza, José M. Lorenzo-Salazar, Rafaela González-Montelongo, Carlos Flores
Format: Article
Language:English
Published: Elsevier 2025-01-01
Series:Computational and Structural Biotechnology Journal
Subjects:
Online Access:http://www.sciencedirect.com/science/article/pii/S2001037025002867
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849720071592607744
author Adrián Muñoz-Barrera
Luis A. Rubio-Rodríguez
David Jáspez
Almudena Corrales
Itahisa Marcelino-Rodriguez
Lourdes Ortiz
Pablo Mendoza
José M. Lorenzo-Salazar
Rafaela González-Montelongo
Carlos Flores
author_facet Adrián Muñoz-Barrera
Luis A. Rubio-Rodríguez
David Jáspez
Almudena Corrales
Itahisa Marcelino-Rodriguez
Lourdes Ortiz
Pablo Mendoza
José M. Lorenzo-Salazar
Rafaela González-Montelongo
Carlos Flores
author_sort Adrián Muñoz-Barrera
collection DOAJ
description Accurate and complete de novo genome assemblies enable variant identification and the discovery of novel genomic features and biological functions. However, de novo assemblies of large and complex genomes remain challenging. Long-read sequencing data, alone or combined with short-read data, facilitate genome assembly. However, the literature has limited comprehensive evaluations of software performance, especially for human genome assembly. We benchmarked 11 pipelines, including four long-read only assemblers and three hybrid assemblers, combined with four polishing schemes, using the HG002 human reference material sequenced with Oxford Nanopore Technologies and Illumina. The best-performing pipeline was validated with non-reference human and non-human routine laboratory samples. Software performance was assessed using QUAST, BUSCO, and Merqury metrics, alongside computational cost analyses. We found that Flye outperformed all assemblers, particularly with Ratatosk error-corrected long-reads. Polishing improved the assembly accuracy and continuity, with two rounds of Racon and Pilon yielding the best results. The assembly of data from validation samples showed comparable assembly metrics to those of the reference material. Based on the results, a complete optimal analysis pipeline for the assembly, polishing, and contig curation developed on Nextflow is provided to enable efficient parallelization and built-in dependency management to further advance the generation of high-quality and chromosome-level assemblies.
format Article
id doaj-art-5fded13f85594088a5b0bb9dec15207e
institution DOAJ
issn 2001-0370
language English
publishDate 2025-01-01
publisher Elsevier
record_format Article
series Computational and Structural Biotechnology Journal
spelling doaj-art-5fded13f85594088a5b0bb9dec15207e2025-08-20T03:12:01ZengElsevierComputational and Structural Biotechnology Journal2001-03702025-01-01273099310910.1016/j.csbj.2025.07.020Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing dataAdrián Muñoz-Barrera0Luis A. Rubio-Rodríguez1David Jáspez2Almudena Corrales3Itahisa Marcelino-Rodriguez4Lourdes Ortiz5Pablo Mendoza6José M. Lorenzo-Salazar7Rafaela González-Montelongo8Carlos Flores9Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, SpainGenomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, SpainGenomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, SpainResearch Unit, Hospital Universitario Nuestra Señora de Candelaria, Instituto de Investigación Sanitaria de Canarias, Santa Cruz de Tenerife, Spain; CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, SpainPreventive Medicine and Public Health Area, Universidad de La Laguna, San Cristóbal de La Laguna, Spain; Institute of Biomedical Technologies, Universidad de La Laguna, San Cristóbal de La Laguna, SpainDepartment of Research and Development in Molecular Diagnostic, Vircell S.L., Granada, SpainDepartment of Research and Development in Molecular Diagnostic, Vircell S.L., Granada, SpainGenomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, SpainGenomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain; Plataforma Genómica de Alto Rendimiento para el Estudio de la Biodiversidad, Associated Unit to Consejo Superior de Investigaciones Científicas (CSIC) by Instituto de Productos Naturales y Agrobiología (IPNA), San Cristóbal de La Laguna, SpainGenomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain; Research Unit, Hospital Universitario Nuestra Señora de Candelaria, Instituto de Investigación Sanitaria de Canarias, Santa Cruz de Tenerife, Spain; CIBER de Enfermedades Respiratorias, Instituto de Salud Carlos III, Madrid, Spain; Plataforma Genómica de Alto Rendimiento para el Estudio de la Biodiversidad, Associated Unit to Consejo Superior de Investigaciones Científicas (CSIC) by Instituto de Productos Naturales y Agrobiología (IPNA), San Cristóbal de La Laguna, Spain; Facultad de Ciencias de la Salud, Universidad Fernando de Pessoa Canarias, Las Palmas de Gran Canaria, Spain; Corresponding author at: Genomics Division, Instituto Tecnológico y de Energías Renovables (ITER), Santa Cruz de Tenerife, Spain.Accurate and complete de novo genome assemblies enable variant identification and the discovery of novel genomic features and biological functions. However, de novo assemblies of large and complex genomes remain challenging. Long-read sequencing data, alone or combined with short-read data, facilitate genome assembly. However, the literature has limited comprehensive evaluations of software performance, especially for human genome assembly. We benchmarked 11 pipelines, including four long-read only assemblers and three hybrid assemblers, combined with four polishing schemes, using the HG002 human reference material sequenced with Oxford Nanopore Technologies and Illumina. The best-performing pipeline was validated with non-reference human and non-human routine laboratory samples. Software performance was assessed using QUAST, BUSCO, and Merqury metrics, alongside computational cost analyses. We found that Flye outperformed all assemblers, particularly with Ratatosk error-corrected long-reads. Polishing improved the assembly accuracy and continuity, with two rounds of Racon and Pilon yielding the best results. The assembly of data from validation samples showed comparable assembly metrics to those of the reference material. Based on the results, a complete optimal analysis pipeline for the assembly, polishing, and contig curation developed on Nextflow is provided to enable efficient parallelization and built-in dependency management to further advance the generation of high-quality and chromosome-level assemblies.http://www.sciencedirect.com/science/article/pii/S2001037025002867Long-read sequencingNanoporeWGSDe novo genome assembly
spellingShingle Adrián Muñoz-Barrera
Luis A. Rubio-Rodríguez
David Jáspez
Almudena Corrales
Itahisa Marcelino-Rodriguez
Lourdes Ortiz
Pablo Mendoza
José M. Lorenzo-Salazar
Rafaela González-Montelongo
Carlos Flores
Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing data
Computational and Structural Biotechnology Journal
Long-read sequencing
Nanopore
WGS
De novo genome assembly
title Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing data
title_full Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing data
title_fullStr Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing data
title_full_unstemmed Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing data
title_short Benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non-human whole-genome sequencing data
title_sort benchmarking of bioinformatics tools for the hybrid de novo assembly of human and non human whole genome sequencing data
topic Long-read sequencing
Nanopore
WGS
De novo genome assembly
url http://www.sciencedirect.com/science/article/pii/S2001037025002867
work_keys_str_mv AT adrianmunozbarrera benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata
AT luisarubiorodriguez benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata
AT davidjaspez benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata
AT almudenacorrales benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata
AT itahisamarcelinorodriguez benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata
AT lourdesortiz benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata
AT pablomendoza benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata
AT josemlorenzosalazar benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata
AT rafaelagonzalezmontelongo benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata
AT carlosflores benchmarkingofbioinformaticstoolsforthehybriddenovoassemblyofhumanandnonhumanwholegenomesequencingdata