Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data

Abstract As next-generation sequencing technologies produce deeper genome coverages at lower costs, there is a critical need for reliable computational host DNA removal in metagenomic data. We find that insufficient host filtration using prior human genome references can introduce false sex biases a...

Full description

Saved in:
Bibliographic Details
Main Authors: Caitlin Guccione, Lucas Patel, Yoshihiko Tomofuji, Daniel McDonald, Antonio Gonzalez, Gregory D. Sepich-Poore, Kyuto Sonehara, Mohsen Zakeri, Yang Chen, Amanda Hazel Dilmore, Neil Damle, Sergio E. Baranzini, George Hightower, Teruaki Nakatsuji, Richard L. Gallo, Ben Langmead, Yukinori Okada, Kit Curtius, Rob Knight
Format: Article
Language:English
Published: Nature Portfolio 2025-01-01
Series:Nature Communications
Online Access:https://doi.org/10.1038/s41467-025-56077-5
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832594578895536128
author Caitlin Guccione
Lucas Patel
Yoshihiko Tomofuji
Daniel McDonald
Antonio Gonzalez
Gregory D. Sepich-Poore
Kyuto Sonehara
Mohsen Zakeri
Yang Chen
Amanda Hazel Dilmore
Neil Damle
Sergio E. Baranzini
George Hightower
Teruaki Nakatsuji
Richard L. Gallo
Ben Langmead
Yukinori Okada
Kit Curtius
Rob Knight
author_facet Caitlin Guccione
Lucas Patel
Yoshihiko Tomofuji
Daniel McDonald
Antonio Gonzalez
Gregory D. Sepich-Poore
Kyuto Sonehara
Mohsen Zakeri
Yang Chen
Amanda Hazel Dilmore
Neil Damle
Sergio E. Baranzini
George Hightower
Teruaki Nakatsuji
Richard L. Gallo
Ben Langmead
Yukinori Okada
Kit Curtius
Rob Knight
author_sort Caitlin Guccione
collection DOAJ
description Abstract As next-generation sequencing technologies produce deeper genome coverages at lower costs, there is a critical need for reliable computational host DNA removal in metagenomic data. We find that insufficient host filtration using prior human genome references can introduce false sex biases and inadvertently permit flow-through of host-specific DNA during bioinformatic analyses, which could be exploited for individual identification. To address these issues, we introduce and benchmark three host filtration methods of varying throughput, with concomitant applications across low biomass samples such as skin and high microbial biomass datasets including fecal samples. We find that these methods are important for obtaining accurate results in low biomass samples (e.g., tissue, skin). Overall, we demonstrate that rigorous host filtration is a key component of privacy-minded analyses of patient microbiomes and provide computationally efficient pipelines for accomplishing this task on large-scale datasets.
format Article
id doaj-art-d1c6c644f8c845f18b85029b4687b675
institution Kabale University
issn 2041-1723
language English
publishDate 2025-01-01
publisher Nature Portfolio
record_format Article
series Nature Communications
spelling doaj-art-d1c6c644f8c845f18b85029b4687b6752025-01-19T12:30:49ZengNature PortfolioNature Communications2041-17232025-01-0116111410.1038/s41467-025-56077-5Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic dataCaitlin Guccione0Lucas Patel1Yoshihiko Tomofuji2Daniel McDonald3Antonio Gonzalez4Gregory D. Sepich-Poore5Kyuto Sonehara6Mohsen Zakeri7Yang Chen8Amanda Hazel Dilmore9Neil Damle10Sergio E. Baranzini11George Hightower12Teruaki Nakatsuji13Richard L. Gallo14Ben Langmead15Yukinori Okada16Kit Curtius17Rob Knight18Division of Biomedical Informatics, Department of Medicine, University of California San DiegoBioinformatics and Systems Biology Program, University of California San DiegoDepartment of Genome Informatics, Graduate School of Medicine, the University of TokyoDepartment of Pediatrics, University of California San DiegoDepartment of Pediatrics, University of California San DiegoShu Chien-Gene Lay Department of Bioengineering, University of California San DiegoDepartment of Genome Informatics, Graduate School of Medicine, the University of TokyoDepartment of Computer Science, Johns Hopkins UniversityDepartment of Pediatrics, University of California San DiegoDepartment of Pediatrics, University of California San DiegoHalıcıoğlu Data Science Institute, University of California San DiegoWeill Institute for Neurosciences. Department of Neurology. University of California, San Francisco (UCSF)Department of Dermatology, University of California San DiegoDepartment of Dermatology, University of California San DiegoDepartment of Dermatology, University of California San DiegoDepartment of Computer Science, Johns Hopkins UniversityDepartment of Genome Informatics, Graduate School of Medicine, the University of TokyoDivision of Biomedical Informatics, Department of Medicine, University of California San DiegoDepartment of Pediatrics, University of California San DiegoAbstract As next-generation sequencing technologies produce deeper genome coverages at lower costs, there is a critical need for reliable computational host DNA removal in metagenomic data. We find that insufficient host filtration using prior human genome references can introduce false sex biases and inadvertently permit flow-through of host-specific DNA during bioinformatic analyses, which could be exploited for individual identification. To address these issues, we introduce and benchmark three host filtration methods of varying throughput, with concomitant applications across low biomass samples such as skin and high microbial biomass datasets including fecal samples. We find that these methods are important for obtaining accurate results in low biomass samples (e.g., tissue, skin). Overall, we demonstrate that rigorous host filtration is a key component of privacy-minded analyses of patient microbiomes and provide computationally efficient pipelines for accomplishing this task on large-scale datasets.https://doi.org/10.1038/s41467-025-56077-5
spellingShingle Caitlin Guccione
Lucas Patel
Yoshihiko Tomofuji
Daniel McDonald
Antonio Gonzalez
Gregory D. Sepich-Poore
Kyuto Sonehara
Mohsen Zakeri
Yang Chen
Amanda Hazel Dilmore
Neil Damle
Sergio E. Baranzini
George Hightower
Teruaki Nakatsuji
Richard L. Gallo
Ben Langmead
Yukinori Okada
Kit Curtius
Rob Knight
Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data
Nature Communications
title Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data
title_full Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data
title_fullStr Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data
title_full_unstemmed Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data
title_short Incomplete human reference genomes can drive false sex biases and expose patient-identifying information in metagenomic data
title_sort incomplete human reference genomes can drive false sex biases and expose patient identifying information in metagenomic data
url https://doi.org/10.1038/s41467-025-56077-5
work_keys_str_mv AT caitlinguccione incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT lucaspatel incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT yoshihikotomofuji incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT danielmcdonald incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT antoniogonzalez incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT gregorydsepichpoore incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT kyutosonehara incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT mohsenzakeri incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT yangchen incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT amandahazeldilmore incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT neildamle incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT sergioebaranzini incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT georgehightower incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT teruakinakatsuji incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT richardlgallo incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT benlangmead incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT yukinoriokada incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT kitcurtius incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata
AT robknight incompletehumanreferencegenomescandrivefalsesexbiasesandexposepatientidentifyinginformationinmetagenomicdata