Optimizing UK biobank cloud-based research analysis platform to fine-map coronary artery disease loci in whole genome sequencing data

Abstract We conducted the first comprehensive association analysis of a coronary artery disease (CAD) cohort within the recently released UK Biobank (UKB) whole genome sequencing dataset. We employed fine mapping tool PolyFun and pinpoint rs10757274 as the most likely causal SNV within the 9p21.3 CA...

Full description

Saved in:
Bibliographic Details
Main Authors: Letitia M.F. Sng, Anubhav Kaphle, Mitchell J. O’Brien, Brendan Hosking, Roc Reguant, Johan Verjans, Yatish Jain, Natalie A. Twine, Denis C. Bauer
Format: Article
Language:English
Published: Nature Portfolio 2025-03-01
Series:Scientific Reports
Subjects:
Online Access:https://doi.org/10.1038/s41598-025-95286-2
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850208727407263744
author Letitia M.F. Sng
Anubhav Kaphle
Mitchell J. O’Brien
Brendan Hosking
Roc Reguant
Johan Verjans
Yatish Jain
Natalie A. Twine
Denis C. Bauer
author_facet Letitia M.F. Sng
Anubhav Kaphle
Mitchell J. O’Brien
Brendan Hosking
Roc Reguant
Johan Verjans
Yatish Jain
Natalie A. Twine
Denis C. Bauer
author_sort Letitia M.F. Sng
collection DOAJ
description Abstract We conducted the first comprehensive association analysis of a coronary artery disease (CAD) cohort within the recently released UK Biobank (UKB) whole genome sequencing dataset. We employed fine mapping tool PolyFun and pinpoint rs10757274 as the most likely causal SNV within the 9p21.3 CAD risk locus. Notably, we show that machine-learning (ML) approaches, REGENIE and VariantSpark, exhibited greater sensitivity compared to traditional single-SNV logistic regression, uncovering rs28451064 a known risk locus in 21q22.11. Our findings underscore the utility of leveraging advanced computational techniques and cloud-based resources for mega-biobank analyses. Aligning with the paradigm shift of bringing compute to data, we demonstrate a 44% cost reduction and 94% speedup through compute architecture optimisation on UK Biobank’s Research Analysis Platform using our RAPpoet approach. We discuss three considerations for researchers implementing novel workflows for datasets hosted on cloud-platforms, to pave the way for harnessing mega-biobank-sized data through scalable, cost-effective cloud computing solutions.
format Article
id doaj-art-ca75fc9fbf1048d9818e8e489ad35b50
institution OA Journals
issn 2045-2322
language English
publishDate 2025-03-01
publisher Nature Portfolio
record_format Article
series Scientific Reports
spelling doaj-art-ca75fc9fbf1048d9818e8e489ad35b502025-08-20T02:10:10ZengNature PortfolioScientific Reports2045-23222025-03-011511910.1038/s41598-025-95286-2Optimizing UK biobank cloud-based research analysis platform to fine-map coronary artery disease loci in whole genome sequencing dataLetitia M.F. Sng0Anubhav Kaphle1Mitchell J. O’Brien2Brendan Hosking3Roc Reguant4Johan Verjans5Yatish Jain6Natalie A. Twine7Denis C. Bauer8Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO)Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO)Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO)Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO)Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO)Australian institute for Machine Learning, University of AdelaideAustralian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO)Australian e-Health Research Centre, Commonwealth Scientific and Industrial Research Organisation (CSIRO)Applied Biosciences, Faculty of Science and Engineering, Macquarie UniversityAbstract We conducted the first comprehensive association analysis of a coronary artery disease (CAD) cohort within the recently released UK Biobank (UKB) whole genome sequencing dataset. We employed fine mapping tool PolyFun and pinpoint rs10757274 as the most likely causal SNV within the 9p21.3 CAD risk locus. Notably, we show that machine-learning (ML) approaches, REGENIE and VariantSpark, exhibited greater sensitivity compared to traditional single-SNV logistic regression, uncovering rs28451064 a known risk locus in 21q22.11. Our findings underscore the utility of leveraging advanced computational techniques and cloud-based resources for mega-biobank analyses. Aligning with the paradigm shift of bringing compute to data, we demonstrate a 44% cost reduction and 94% speedup through compute architecture optimisation on UK Biobank’s Research Analysis Platform using our RAPpoet approach. We discuss three considerations for researchers implementing novel workflows for datasets hosted on cloud-platforms, to pave the way for harnessing mega-biobank-sized data through scalable, cost-effective cloud computing solutions.https://doi.org/10.1038/s41598-025-95286-2Population-scale geneticsUK BiobankDNAnexusCloud-computingGWASTrusted research environments
spellingShingle Letitia M.F. Sng
Anubhav Kaphle
Mitchell J. O’Brien
Brendan Hosking
Roc Reguant
Johan Verjans
Yatish Jain
Natalie A. Twine
Denis C. Bauer
Optimizing UK biobank cloud-based research analysis platform to fine-map coronary artery disease loci in whole genome sequencing data
Scientific Reports
Population-scale genetics
UK Biobank
DNAnexus
Cloud-computing
GWAS
Trusted research environments
title Optimizing UK biobank cloud-based research analysis platform to fine-map coronary artery disease loci in whole genome sequencing data
title_full Optimizing UK biobank cloud-based research analysis platform to fine-map coronary artery disease loci in whole genome sequencing data
title_fullStr Optimizing UK biobank cloud-based research analysis platform to fine-map coronary artery disease loci in whole genome sequencing data
title_full_unstemmed Optimizing UK biobank cloud-based research analysis platform to fine-map coronary artery disease loci in whole genome sequencing data
title_short Optimizing UK biobank cloud-based research analysis platform to fine-map coronary artery disease loci in whole genome sequencing data
title_sort optimizing uk biobank cloud based research analysis platform to fine map coronary artery disease loci in whole genome sequencing data
topic Population-scale genetics
UK Biobank
DNAnexus
Cloud-computing
GWAS
Trusted research environments
url https://doi.org/10.1038/s41598-025-95286-2
work_keys_str_mv AT letitiamfsng optimizingukbiobankcloudbasedresearchanalysisplatformtofinemapcoronaryarterydiseaselociinwholegenomesequencingdata
AT anubhavkaphle optimizingukbiobankcloudbasedresearchanalysisplatformtofinemapcoronaryarterydiseaselociinwholegenomesequencingdata
AT mitchelljobrien optimizingukbiobankcloudbasedresearchanalysisplatformtofinemapcoronaryarterydiseaselociinwholegenomesequencingdata
AT brendanhosking optimizingukbiobankcloudbasedresearchanalysisplatformtofinemapcoronaryarterydiseaselociinwholegenomesequencingdata
AT rocreguant optimizingukbiobankcloudbasedresearchanalysisplatformtofinemapcoronaryarterydiseaselociinwholegenomesequencingdata
AT johanverjans optimizingukbiobankcloudbasedresearchanalysisplatformtofinemapcoronaryarterydiseaselociinwholegenomesequencingdata
AT yatishjain optimizingukbiobankcloudbasedresearchanalysisplatformtofinemapcoronaryarterydiseaselociinwholegenomesequencingdata
AT natalieatwine optimizingukbiobankcloudbasedresearchanalysisplatformtofinemapcoronaryarterydiseaselociinwholegenomesequencingdata
AT deniscbauer optimizingukbiobankcloudbasedresearchanalysisplatformtofinemapcoronaryarterydiseaselociinwholegenomesequencingdata