Single-cell data combined with phenotypes improves variant interpretation

Abstract Background Whole genome sequencing offers significant potential to improve the diagnosis and treatment of rare diseases by enabling the identification of thousands of rare, potentially pathogenic variants. Existing variant prioritisation tools can be complemented by approaches that incorpor...

Full description

Saved in:
Bibliographic Details
Main Authors: Timothy Chapman, Timo Lassmann
Format: Article
Language:English
Published: BMC 2025-05-01
Series:BMC Genomics
Subjects:
Online Access:https://doi.org/10.1186/s12864-025-11711-w
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849705102717222912
author Timothy Chapman
Timo Lassmann
author_facet Timothy Chapman
Timo Lassmann
author_sort Timothy Chapman
collection DOAJ
description Abstract Background Whole genome sequencing offers significant potential to improve the diagnosis and treatment of rare diseases by enabling the identification of thousands of rare, potentially pathogenic variants. Existing variant prioritisation tools can be complemented by approaches that incorporate phenotype specificity and provide contextual biological information, such as tissue or cell-type specificity. We hypothesised that integrating single-cell gene expression data into phenotype-specific models would improve the accuracy and interpretability of pathogenic variant prioritisation. Methods To test this hypothesis, we developed IMPPROVE, a new tool that constructs phenotype-specific ensemble models integrating CADD scores with bulk and single-cell gene expression data. We constructed a total of 1,866 Random Forest models for individual HPO terms, incorporating both bulk and single cell expression data. Results Our phenotype-specific models utilising expression data can better predict pathogenic variants in 90% of the phenotypes (HPO terms) considered. Using single-cell expression data instead of bulk benefited the models, significantly shifting the proportion of pathogenic variants that were correctly identified at a fixed false positive rate $$(p\;<10^{-30}$$ ( p < 10 - 30 , using an approximate Wilcoxon signed rank test). We found 57 phenotypes’ models exhibited a large performance difference, depending on the dataset used. Further analysis revealed biological links between the pathology and the tissues or cell-types used by these 57 models. Conclusions Phenotype-specific models that integrate gene expression data with CADD scores show great promise in improving variant prioritisation. In addition to improving diagnostic accuracy, these models offer insights into the underlying biological mechanisms of rare diseases. Enriching existing pathogenicity-related scores with gene expression datasets has the potential to advance personalised medicine through more accurate and interpretable variant prioritisation.
format Article
id doaj-art-4b23f1ec699c4b308491ebb0d82ecd2c
institution DOAJ
issn 1471-2164
language English
publishDate 2025-05-01
publisher BMC
record_format Article
series BMC Genomics
spelling doaj-art-4b23f1ec699c4b308491ebb0d82ecd2c2025-08-20T03:16:33ZengBMCBMC Genomics1471-21642025-05-0126111610.1186/s12864-025-11711-wSingle-cell data combined with phenotypes improves variant interpretationTimothy Chapman0Timo Lassmann1The Kids Research Institute AustraliaThe Kids Research Institute AustraliaAbstract Background Whole genome sequencing offers significant potential to improve the diagnosis and treatment of rare diseases by enabling the identification of thousands of rare, potentially pathogenic variants. Existing variant prioritisation tools can be complemented by approaches that incorporate phenotype specificity and provide contextual biological information, such as tissue or cell-type specificity. We hypothesised that integrating single-cell gene expression data into phenotype-specific models would improve the accuracy and interpretability of pathogenic variant prioritisation. Methods To test this hypothesis, we developed IMPPROVE, a new tool that constructs phenotype-specific ensemble models integrating CADD scores with bulk and single-cell gene expression data. We constructed a total of 1,866 Random Forest models for individual HPO terms, incorporating both bulk and single cell expression data. Results Our phenotype-specific models utilising expression data can better predict pathogenic variants in 90% of the phenotypes (HPO terms) considered. Using single-cell expression data instead of bulk benefited the models, significantly shifting the proportion of pathogenic variants that were correctly identified at a fixed false positive rate $$(p\;<10^{-30}$$ ( p < 10 - 30 , using an approximate Wilcoxon signed rank test). We found 57 phenotypes’ models exhibited a large performance difference, depending on the dataset used. Further analysis revealed biological links between the pathology and the tissues or cell-types used by these 57 models. Conclusions Phenotype-specific models that integrate gene expression data with CADD scores show great promise in improving variant prioritisation. In addition to improving diagnostic accuracy, these models offer insights into the underlying biological mechanisms of rare diseases. Enriching existing pathogenicity-related scores with gene expression datasets has the potential to advance personalised medicine through more accurate and interpretable variant prioritisation.https://doi.org/10.1186/s12864-025-11711-wRare diseaseVariant prioritisationMachine learningRandom forestInterpretable modelsWhole Genome sequencing
spellingShingle Timothy Chapman
Timo Lassmann
Single-cell data combined with phenotypes improves variant interpretation
BMC Genomics
Rare disease
Variant prioritisation
Machine learning
Random forest
Interpretable models
Whole Genome sequencing
title Single-cell data combined with phenotypes improves variant interpretation
title_full Single-cell data combined with phenotypes improves variant interpretation
title_fullStr Single-cell data combined with phenotypes improves variant interpretation
title_full_unstemmed Single-cell data combined with phenotypes improves variant interpretation
title_short Single-cell data combined with phenotypes improves variant interpretation
title_sort single cell data combined with phenotypes improves variant interpretation
topic Rare disease
Variant prioritisation
Machine learning
Random forest
Interpretable models
Whole Genome sequencing
url https://doi.org/10.1186/s12864-025-11711-w
work_keys_str_mv AT timothychapman singlecelldatacombinedwithphenotypesimprovesvariantinterpretation
AT timolassmann singlecelldatacombinedwithphenotypesimprovesvariantinterpretation