Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data
Abstract As next-generation sequencing technologies produce deeper genome coverages at lower costs, there is a critical need for reliable computational host DNA removal in metagenomic data. We find that insufficient host filtration using prior human genome references can introduce false sex biases a...
Saved in:
Main Authors: | , , , , , , , , , , , , , , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Nature Portfolio
2025-01-01
|
Series: | Nature Communications |
Online Access: | https://doi.org/10.1038/s41467-025-56077-5 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832594578895536128 |
---|---|
author | Caitlin Guccione Lucas Patel Yoshihiko Tomofuji Daniel McDonald Antonio Gonzalez Gregory D. Sepich-Poore Kyuto Sonehara Mohsen Zakeri Yang Chen Amanda Hazel Dilmore Neil Damle Sergio E. Baranzini George Hightower Teruaki Nakatsuji Richard L. Gallo Ben Langmead Yukinori Okada Kit Curtius Rob Knight |
author_facet | Caitlin Guccione Lucas Patel Yoshihiko Tomofuji Daniel McDonald Antonio Gonzalez Gregory D. Sepich-Poore Kyuto Sonehara Mohsen Zakeri Yang Chen Amanda Hazel Dilmore Neil Damle Sergio E. Baranzini George Hightower Teruaki Nakatsuji Richard L. Gallo Ben Langmead Yukinori Okada Kit Curtius Rob Knight |
author_sort | Caitlin Guccione |
collection | DOAJ |
description | Abstract As next-generation sequencing technologies produce deeper genome coverages at lower costs, there is a critical need for reliable computational host DNA removal in metagenomic data. We find that insufficient host filtration using prior human genome references can introduce false sex biases and inadvertently permit flow-through of host-specific DNA during bioinformatic analyses, which could be exploited for individual identification. To address these issues, we introduce and benchmark three host filtration methods of varying throughput, with concomitant applications across low biomass samples such as skin and high microbial biomass datasets including fecal samples. We find that these methods are important for obtaining accurate results in low biomass samples (e.g., tissue, skin). Overall, we demonstrate that rigorous host filtration is a key component of privacy-minded analyses of patient microbiomes and provide computationally efficient pipelines for accomplishing this task on large-scale datasets. |
format | Article |
id | doaj-art-d1c6c644f8c845f18b85029b4687b675 |
institution | Kabale University |
issn | 2041-1723 |
language | English |
publishDate | 2025-01-01 |
publisher | Nature Portfolio |
record_format | Article |
series | Nature Communications |
spelling | doaj-art-d1c6c644f8c845f18b85029b4687b6752025-01-19T12:30:49ZengNature PortfolioNature Communications2041-17232025-01-0116111410.1038/s41467-025-56077-5Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic dataCaitlin Guccione0Lucas Patel1Yoshihiko Tomofuji2Daniel McDonald3Antonio Gonzalez4Gregory D. Sepich-Poore5Kyuto Sonehara6Mohsen Zakeri7Yang Chen8Amanda Hazel Dilmore9Neil Damle10Sergio E. Baranzini11George Hightower12Teruaki Nakatsuji13Richard L. Gallo14Ben Langmead15Yukinori Okada16Kit Curtius17Rob Knight18Division of Biomedical Informatics, Department of Medicine, University of California San DiegoBioinformatics and Systems Biology Program, University of California San DiegoDepartment of Genome Informatics, Graduate School of Medicine, the University of TokyoDepartment of Pediatrics, University of California San DiegoDepartment of Pediatrics, University of California San DiegoShu Chien-Gene Lay Department of Bioengineering, University of California San DiegoDepartment of Genome Informatics, Graduate School of Medicine, the University of TokyoDepartment of Computer Science, Johns Hopkins UniversityDepartment of Pediatrics, University of California San DiegoDepartment of Pediatrics, University of California San DiegoHalıcıoğlu Data Science Institute, University of California San DiegoWeill Institute for Neurosciences. Department of Neurology. University of California, San Francisco (UCSF)Department of Dermatology, University of California San DiegoDepartment of Dermatology, University of California San DiegoDepartment of Dermatology, University of California San DiegoDepartment of Computer Science, Johns Hopkins UniversityDepartment of Genome Informatics, Graduate School of Medicine, the University of TokyoDivision of Biomedical Informatics, Department of Medicine, University of California San DiegoDepartment of Pediatrics, University of California San DiegoAbstract As next-generation sequencing technologies produce deeper genome coverages at lower costs, there is a critical need for reliable computational host DNA removal in metagenomic data. We find that insufficient host filtration using prior human genome references can introduce false sex biases and inadvertently permit flow-through of host-specific DNA during bioinformatic analyses, which could be exploited for individual identification. To address these issues, we introduce and benchmark three host filtration methods of varying throughput, with concomitant applications across low biomass samples such as skin and high microbial biomass datasets including fecal samples. We find that these methods are important for obtaining accurate results in low biomass samples (e.g., tissue, skin). Overall, we demonstrate that rigorous host filtration is a key component of privacy-minded analyses of patient microbiomes and provide computationally efficient pipelines for accomplishing this task on large-scale datasets.https://doi.org/10.1038/s41467-025-56077-5 |
spellingShingle | Caitlin Guccione Lucas Patel Yoshihiko Tomofuji Daniel McDonald Antonio Gonzalez Gregory D. Sepich-Poore Kyuto Sonehara Mohsen Zakeri Yang Chen Amanda Hazel Dilmore Neil Damle Sergio E. Baranzini George Hightower Teruaki Nakatsuji Richard L. Gallo Ben Langmead Yukinori Okada Kit Curtius Rob Knight Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data Nature Communications |
title | Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data |
title_full | Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data |
title_fullStr | Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data |
title_full_unstemmed | Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data |
title_short | Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data |
title_sort | incomplete human reference genomes can drive false sex biases and expose patient identifying information in metagenomic data |
url | https://doi.org/10.1038/s41467-025-56077-5 |
work_keys_str_mv | AT caitlinguccione incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT lucaspatel incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT yoshihikotomofuji incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT danielmcdonald incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT antoniogonzalez incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT gregorydsepichpoore incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT kyutosonehara incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT mohsenzakeri incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT yangchen incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT amandahazeldilmore incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT neildamle incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT sergioebaranzini incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT georgehightower incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT teruakinakatsuji incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT richardlgallo incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT benlangmead incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT yukinoriokada incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT kitcurtius incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata AT robknight incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata |