Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets

De novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexi...

Full description

Saved in:

Bibliographic Details
Main Authors:	Gonzalez Sergio Alberto, Rivarola Maximo, Ribone Andres, Lew Sergio, Paniego Norma
Format:	Article
Language:	English
Published:	SAGE Publishing 2024-12-01
Series:	Bioinformatics and Biology Insights
Online Access:	https://doi.org/10.1177/11779322241274957
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850135511438458880
author	Gonzalez Sergio Alberto Rivarola Maximo Ribone Andres Lew Sergio Paniego Norma
author_facet	Gonzalez Sergio Alberto Rivarola Maximo Ribone Andres Lew Sergio Paniego Norma
author_sort	Gonzalez Sergio Alberto
collection	DOAJ
description	De novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexity of the transcriptome and is limited by several types of errors. One problem to overcome is the research gap regarding the best method to use in each study to obtain high-quality de novo assembly. Currently, there are no established protocols for solving the assembly problem considering the transcriptome complexity. In addition, the accuracy of quality metrics used to evaluate assemblies remains unclear. In this study, we investigate and discuss how different variables accounting for the complexity of RNA-Seq data influence assembly results independently of the software used. For this purpose, we simulated transcriptomic short-read sequence datasets from high-quality full-length predicted transcript models with varying degrees of complexity. Subsequently, we conducted de novo assemblies using different assembly programs, and compared and classified the results using both reference-dependent and independent metrics. These metrics were assessed both individually and combined through multivariate analysis. The degree of alternative splicing and the fragment size of the paired-end reads were identified as the variables with the greatest influence on the assembly results. Moreover, read length and fragment size had different influences on the reconstruction of longer and shorter transcripts. These results underscore the importance of understanding the composition of the transcriptome under study, and making experimental design decisions related to the need to work with reads and fragments of different sizes. In addition, the choice of assembly software will positively impact the final assembly outcome. This selection will affect the completeness of represented genes and assembled isoforms, as well as contribute to error reduction.
format	Article
id	doaj-art-23dd421cfef6480fbace02d4a2ac6ff8
institution	OA Journals
issn	1177-9322
language	English
publishDate	2024-12-01
publisher	SAGE Publishing
record_format	Article
series	Bioinformatics and Biology Insights
spelling	doaj-art-23dd421cfef6480fbace02d4a2ac6ff82025-08-20T02:31:23ZengSAGE PublishingBioinformatics and Biology Insights1177-93222024-12-011810.1177/11779322241274957Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq DatasetsGonzalez Sergio Alberto0Rivarola Maximo1Ribone Andres2Lew Sergio3Paniego Norma4Instituto de Agrobiotecnología y Biología Molecular (IABIMO), CICVyA, Instituto Nacional de Tecnología Agropecuaria (INTA), Buenos Aires, ArgentinaConsejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, ArgentinaConsejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, ArgentinaInstituto de Ingeniería Biomédica, Facultad de Ingeniería, Universidad de Buenos Aires, Buenos Aires, ArgentinaConsejo Nacional de Investigaciones Científicas y Técnicas (CONICET), Buenos Aires, ArgentinaDe novo assembly of transcriptomes from species without reference genome remains a common problem in functional genomics. While methods and algorithms for transcriptome assembly are continually being developed and published, the quality of de novo assemblies using short reads depends on the complexity of the transcriptome and is limited by several types of errors. One problem to overcome is the research gap regarding the best method to use in each study to obtain high-quality de novo assembly. Currently, there are no established protocols for solving the assembly problem considering the transcriptome complexity. In addition, the accuracy of quality metrics used to evaluate assemblies remains unclear. In this study, we investigate and discuss how different variables accounting for the complexity of RNA-Seq data influence assembly results independently of the software used. For this purpose, we simulated transcriptomic short-read sequence datasets from high-quality full-length predicted transcript models with varying degrees of complexity. Subsequently, we conducted de novo assemblies using different assembly programs, and compared and classified the results using both reference-dependent and independent metrics. These metrics were assessed both individually and combined through multivariate analysis. The degree of alternative splicing and the fragment size of the paired-end reads were identified as the variables with the greatest influence on the assembly results. Moreover, read length and fragment size had different influences on the reconstruction of longer and shorter transcripts. These results underscore the importance of understanding the composition of the transcriptome under study, and making experimental design decisions related to the need to work with reads and fragments of different sizes. In addition, the choice of assembly software will positively impact the final assembly outcome. This selection will affect the completeness of represented genes and assembled isoforms, as well as contribute to error reduction.https://doi.org/10.1177/11779322241274957
spellingShingle	Gonzalez Sergio Alberto Rivarola Maximo Ribone Andres Lew Sergio Paniego Norma Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets Bioinformatics and Biology Insights
title	Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_full	Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_fullStr	Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_full_unstemmed	Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_short	Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets
title_sort	comprehensive analysis of the influence of technical and biological variations on de novo assembly of rna seq datasets
url	https://doi.org/10.1177/11779322241274957
work_keys_str_mv	AT gonzalezsergioalberto comprehensiveanalysisoftheinfluenceoftechnicalandbiologicalvariationsondenovoassemblyofrnaseqdatasets AT rivarolamaximo comprehensiveanalysisoftheinfluenceoftechnicalandbiologicalvariationsondenovoassemblyofrnaseqdatasets AT riboneandres comprehensiveanalysisoftheinfluenceoftechnicalandbiologicalvariationsondenovoassemblyofrnaseqdatasets AT lewsergio comprehensiveanalysisoftheinfluenceoftechnicalandbiologicalvariationsondenovoassemblyofrnaseqdatasets AT paniegonorma comprehensiveanalysisoftheinfluenceoftechnicalandbiologicalvariationsondenovoassemblyofrnaseqdatasets

Comprehensive Analysis of the Influence of Technical and Biological Variations on De Novo Assembly of RNA-Seq Datasets

Similar Items