A scalable distributed pipeline for reference-free variants calling

Abstract Background Precision medicine pipelines typically begin with variant calling to identify disease-related mutations for optimal treatment selection. Reference-free approaches assess variations in the genetic profiles of distinct individuals through the utilization of a De Bruijn graph. Howev...

Full description

Saved in:

Bibliographic Details
Main Authors:	Lorenzo Di Rocco, Umberto Ferraro Petrillo
Format:	Article
Language:	English
Published:	BMC 2025-06-01
Series:	BMC Genomics
Subjects:	Computational genomics Variants calling Distributed computing
Online Access:	https://doi.org/10.1186/s12864-025-11722-7
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850223950167015424
author	Lorenzo Di Rocco Umberto Ferraro Petrillo
author_facet	Lorenzo Di Rocco Umberto Ferraro Petrillo
author_sort	Lorenzo Di Rocco
collection	DOAJ
description	Abstract Background Precision medicine pipelines typically begin with variant calling to identify disease-related mutations for optimal treatment selection. Reference-free approaches assess variations in the genetic profiles of distinct individuals through the utilization of a De Bruijn graph. However, the timely analysis of large-scale sequencing data may be beyond the capabilities of single workstations, requiring alternative computational approaches. Results We introduce the first-known distributed pipeline for detecting isolated SNPs (Single Nucleotide Polymorphisms), by leveraging the computational resources of multiple machines in parallel. Our pipeline efficiently analyzes large datasets thanks to the usage of a distributed De Bruijn graph representation. Furthermore, we introduce a cluster-driven algorithm to partition the De Bruijn graph across multiple independent machines according to the inner structure of the sequences under analysis, thus further improving the scalability of our pipeline. Conclusions The results of our experiments, conducted on real-world datasets, show the good performance of our pipeline in terms of efficiency, output quality and scalability. Moreover, the reported results also confirm that the adoption of a specialized partitioning algorithm for the distributed representation of the De Bruijn graph leads to a relevant performance speed-up compared to using standard partitioning techniques.
format	Article
id	doaj-art-ea45e51a3bc2417bb74a4a70d24d9061
institution	OA Journals
issn	1471-2164
language	English
publishDate	2025-06-01
publisher	BMC
record_format	Article
series	BMC Genomics
spelling	doaj-art-ea45e51a3bc2417bb74a4a70d24d90612025-08-20T02:05:46ZengBMCBMC Genomics1471-21642025-06-0126S111410.1186/s12864-025-11722-7A scalable distributed pipeline for reference-free variants callingLorenzo Di Rocco0Umberto Ferraro Petrillo1Department of Statistical Sciences, Sapienza University of RomeDepartment of Statistical Sciences, Sapienza University of RomeAbstract Background Precision medicine pipelines typically begin with variant calling to identify disease-related mutations for optimal treatment selection. Reference-free approaches assess variations in the genetic profiles of distinct individuals through the utilization of a De Bruijn graph. However, the timely analysis of large-scale sequencing data may be beyond the capabilities of single workstations, requiring alternative computational approaches. Results We introduce the first-known distributed pipeline for detecting isolated SNPs (Single Nucleotide Polymorphisms), by leveraging the computational resources of multiple machines in parallel. Our pipeline efficiently analyzes large datasets thanks to the usage of a distributed De Bruijn graph representation. Furthermore, we introduce a cluster-driven algorithm to partition the De Bruijn graph across multiple independent machines according to the inner structure of the sequences under analysis, thus further improving the scalability of our pipeline. Conclusions The results of our experiments, conducted on real-world datasets, show the good performance of our pipeline in terms of efficiency, output quality and scalability. Moreover, the reported results also confirm that the adoption of a specialized partitioning algorithm for the distributed representation of the De Bruijn graph leads to a relevant performance speed-up compared to using standard partitioning techniques.https://doi.org/10.1186/s12864-025-11722-7Computational genomicsVariants callingDistributed computing
spellingShingle	Lorenzo Di Rocco Umberto Ferraro Petrillo A scalable distributed pipeline for reference-free variants calling BMC Genomics Computational genomics Variants calling Distributed computing
title	A scalable distributed pipeline for reference-free variants calling
title_full	A scalable distributed pipeline for reference-free variants calling
title_fullStr	A scalable distributed pipeline for reference-free variants calling
title_full_unstemmed	A scalable distributed pipeline for reference-free variants calling
title_short	A scalable distributed pipeline for reference-free variants calling
title_sort	scalable distributed pipeline for reference free variants calling
topic	Computational genomics Variants calling Distributed computing
url	https://doi.org/10.1186/s12864-025-11722-7
work_keys_str_mv	AT lorenzodirocco ascalabledistributedpipelineforreferencefreevariantscalling AT umbertoferraropetrillo ascalabledistributedpipelineforreferencefreevariantscalling AT lorenzodirocco scalabledistributedpipelineforreferencefreevariantscalling AT umbertoferraropetrillo scalabledistributedpipelineforreferencefreevariantscalling

A scalable distributed pipeline for reference-free variants calling

Similar Items