A scalable distributed pipeline for reference-free variants calling

Abstract Background Precision medicine pipelines typically begin with variant calling to identify disease-related mutations for optimal treatment selection. Reference-free approaches assess variations in the genetic profiles of distinct individuals through the utilization of a De Bruijn graph. Howev...

Full description

Saved in:
Bibliographic Details
Main Authors: Lorenzo Di Rocco, Umberto Ferraro Petrillo
Format: Article
Language:English
Published: BMC 2025-06-01
Series:BMC Genomics
Subjects:
Online Access:https://doi.org/10.1186/s12864-025-11722-7
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850223950167015424
author Lorenzo Di Rocco
Umberto Ferraro Petrillo
author_facet Lorenzo Di Rocco
Umberto Ferraro Petrillo
author_sort Lorenzo Di Rocco
collection DOAJ
description Abstract Background Precision medicine pipelines typically begin with variant calling to identify disease-related mutations for optimal treatment selection. Reference-free approaches assess variations in the genetic profiles of distinct individuals through the utilization of a De Bruijn graph. However, the timely analysis of large-scale sequencing data may be beyond the capabilities of single workstations, requiring alternative computational approaches. Results We introduce the first-known distributed pipeline for detecting isolated SNPs (Single Nucleotide Polymorphisms), by leveraging the computational resources of multiple machines in parallel. Our pipeline efficiently analyzes large datasets thanks to the usage of a distributed De Bruijn graph representation. Furthermore, we introduce a cluster-driven algorithm to partition the De Bruijn graph across multiple independent machines according to the inner structure of the sequences under analysis, thus further improving the scalability of our pipeline. Conclusions The results of our experiments, conducted on real-world datasets, show the good performance of our pipeline in terms of efficiency, output quality and scalability. Moreover, the reported results also confirm that the adoption of a specialized partitioning algorithm for the distributed representation of the De Bruijn graph leads to a relevant performance speed-up compared to using standard partitioning techniques.
format Article
id doaj-art-ea45e51a3bc2417bb74a4a70d24d9061
institution OA Journals
issn 1471-2164
language English
publishDate 2025-06-01
publisher BMC
record_format Article
series BMC Genomics
spelling doaj-art-ea45e51a3bc2417bb74a4a70d24d90612025-08-20T02:05:46ZengBMCBMC Genomics1471-21642025-06-0126S111410.1186/s12864-025-11722-7A scalable distributed pipeline for reference-free variants callingLorenzo Di Rocco0Umberto Ferraro Petrillo1Department of Statistical Sciences, Sapienza University of RomeDepartment of Statistical Sciences, Sapienza University of RomeAbstract Background Precision medicine pipelines typically begin with variant calling to identify disease-related mutations for optimal treatment selection. Reference-free approaches assess variations in the genetic profiles of distinct individuals through the utilization of a De Bruijn graph. However, the timely analysis of large-scale sequencing data may be beyond the capabilities of single workstations, requiring alternative computational approaches. Results We introduce the first-known distributed pipeline for detecting isolated SNPs (Single Nucleotide Polymorphisms), by leveraging the computational resources of multiple machines in parallel. Our pipeline efficiently analyzes large datasets thanks to the usage of a distributed De Bruijn graph representation. Furthermore, we introduce a cluster-driven algorithm to partition the De Bruijn graph across multiple independent machines according to the inner structure of the sequences under analysis, thus further improving the scalability of our pipeline. Conclusions The results of our experiments, conducted on real-world datasets, show the good performance of our pipeline in terms of efficiency, output quality and scalability. Moreover, the reported results also confirm that the adoption of a specialized partitioning algorithm for the distributed representation of the De Bruijn graph leads to a relevant performance speed-up compared to using standard partitioning techniques.https://doi.org/10.1186/s12864-025-11722-7Computational genomicsVariants callingDistributed computing
spellingShingle Lorenzo Di Rocco
Umberto Ferraro Petrillo
A scalable distributed pipeline for reference-free variants calling
BMC Genomics
Computational genomics
Variants calling
Distributed computing
title A scalable distributed pipeline for reference-free variants calling
title_full A scalable distributed pipeline for reference-free variants calling
title_fullStr A scalable distributed pipeline for reference-free variants calling
title_full_unstemmed A scalable distributed pipeline for reference-free variants calling
title_short A scalable distributed pipeline for reference-free variants calling
title_sort scalable distributed pipeline for reference free variants calling
topic Computational genomics
Variants calling
Distributed computing
url https://doi.org/10.1186/s12864-025-11722-7
work_keys_str_mv AT lorenzodirocco ascalabledistributedpipelineforreferencefreevariantscalling
AT umbertoferraropetrillo ascalabledistributedpipelineforreferencefreevariantscalling
AT lorenzodirocco scalabledistributedpipelineforreferencefreevariantscalling
AT umbertoferraropetrillo scalabledistributedpipelineforreferencefreevariantscalling