Inference Attacks on Genomic Data Based on Probabilistic Graphical Models

The rapid progress and plummeting costs of human-genome sequencing enable the availability of large amount of personal biomedical information, leading to one of the most important concerns — genomic data privacy. Since personal biomedical data are highly correlated with relatives, with the increasin...

Full description

Saved in:

Bibliographic Details
Main Authors:	Zaobo He, Junxiu Zhou
Format:	Article
Language:	English
Published:	Tsinghua University Press 2020-09-01
Series:	Big Data Mining and Analytics
Subjects:	single nucleotide polymorphism (snp)-trait association belief propagation factor graph data sanitization
Online Access:	https://www.sciopen.com/article/10.26599/BDMA.2020.9020008
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832573638408142848
author	Zaobo He Junxiu Zhou
author_facet	Zaobo He Junxiu Zhou
author_sort	Zaobo He
collection	DOAJ
description	The rapid progress and plummeting costs of human-genome sequencing enable the availability of large amount of personal biomedical information, leading to one of the most important concerns — genomic data privacy. Since personal biomedical data are highly correlated with relatives, with the increasing availability of genomes and personal traits online (i.e., leakage unwittingly, or after their releasing intentionally to genetic service platforms), kin-genomic data privacy is threatened. We propose new inference attacks to predict unknown Single Nucleotide Polymorphisms (SNPs) and human traits of individuals in a familial genomic dataset based on probabilistic graphical models and belief propagation. With this method, the adversary can predict the unobserved genomes or traits of targeted individuals in a family genomic dataset where some individuals’ genomes and traits are observed, relying on SNP-trait association from Genome-Wide Association Study (GWAS), Mendel’s Laws, and statistical relations between SNPs. Existing genome inferences have relatively high computational complexity with the input of tens of millions of SNPs and human traits. Then, we propose an approach to publish genomic data with differential privacy guarantee. After finding an approximate distribution of the input genomic dataset relying on Bayesian networks, a noisy distribution is obtained after injecting noise into the approximate distribution. Finally, synthetic genomic dataset is sampled and it is proved that any query on synthetic dataset satisfies differential privacy guarantee.
format	Article
id	doaj-art-a5c2f845396e4632a4cd6e40b7196a31
institution	Kabale University
issn	2096-0654
language	English
publishDate	2020-09-01
publisher	Tsinghua University Press
record_format	Article
series	Big Data Mining and Analytics
spelling	doaj-art-a5c2f845396e4632a4cd6e40b7196a312025-02-02T03:45:08ZengTsinghua University PressBig Data Mining and Analytics2096-06542020-09-013322523310.26599/BDMA.2020.9020008Inference Attacks on Genomic Data Based on Probabilistic Graphical ModelsZaobo He0Junxiu Zhou1<institution content-type="dept">Department of Computer Science and Software Engineering</institution>, <institution>Miami University</institution>, <city>Oxford</city>, <state>OH</state> <postal-code>45011</postal-code>, <country>USA</country>.<institution content-type="dept">Department of Computer Science and Software Engineering</institution>, <institution>Miami University</institution>, <city>Oxford</city>, <state>OH</state> <postal-code>45011</postal-code>, <country>USA</country>.The rapid progress and plummeting costs of human-genome sequencing enable the availability of large amount of personal biomedical information, leading to one of the most important concerns — genomic data privacy. Since personal biomedical data are highly correlated with relatives, with the increasing availability of genomes and personal traits online (i.e., leakage unwittingly, or after their releasing intentionally to genetic service platforms), kin-genomic data privacy is threatened. We propose new inference attacks to predict unknown Single Nucleotide Polymorphisms (SNPs) and human traits of individuals in a familial genomic dataset based on probabilistic graphical models and belief propagation. With this method, the adversary can predict the unobserved genomes or traits of targeted individuals in a family genomic dataset where some individuals’ genomes and traits are observed, relying on SNP-trait association from Genome-Wide Association Study (GWAS), Mendel’s Laws, and statistical relations between SNPs. Existing genome inferences have relatively high computational complexity with the input of tens of millions of SNPs and human traits. Then, we propose an approach to publish genomic data with differential privacy guarantee. After finding an approximate distribution of the input genomic dataset relying on Bayesian networks, a noisy distribution is obtained after injecting noise into the approximate distribution. Finally, synthetic genomic dataset is sampled and it is proved that any query on synthetic dataset satisfies differential privacy guarantee.https://www.sciopen.com/article/10.26599/BDMA.2020.9020008single nucleotide polymorphism (snp)-trait associationbelief propagationfactor graphdata sanitization
spellingShingle	Zaobo He Junxiu Zhou Inference Attacks on Genomic Data Based on Probabilistic Graphical Models Big Data Mining and Analytics single nucleotide polymorphism (snp)-trait association belief propagation factor graph data sanitization
title	Inference Attacks on Genomic Data Based on Probabilistic Graphical Models
title_full	Inference Attacks on Genomic Data Based on Probabilistic Graphical Models
title_fullStr	Inference Attacks on Genomic Data Based on Probabilistic Graphical Models
title_full_unstemmed	Inference Attacks on Genomic Data Based on Probabilistic Graphical Models
title_short	Inference Attacks on Genomic Data Based on Probabilistic Graphical Models
title_sort	inference attacks on genomic data based on probabilistic graphical models
topic	single nucleotide polymorphism (snp)-trait association belief propagation factor graph data sanitization
url	https://www.sciopen.com/article/10.26599/BDMA.2020.9020008
work_keys_str_mv	AT zaobohe inferenceattacksongenomicdatabasedonprobabilisticgraphicalmodels AT junxiuzhou inferenceattacksongenomicdatabasedonprobabilisticgraphicalmodels

Inference Attacks on Genomic Data Based on Probabilistic Graphical Models

Similar Items