Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem

The growing interest in data privacy and anonymization presents challenges, as traditional methods such as ordinal discretization often result in information loss by coarsening metric data. Current research suggests that modeling the latent distributions of ordinal classes can reduce the effectivene...

Full description

Saved in:
Bibliographic Details
Main Authors: Stefan Michael Stroka, Christian Heumann
Format: Article
Language:English
Published: MDPI AG 2024-10-01
Series:Stats
Subjects:
Online Access:https://www.mdpi.com/2571-905X/7/4/70
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850037045967192064
author Stefan Michael Stroka
Christian Heumann
author_facet Stefan Michael Stroka
Christian Heumann
author_sort Stefan Michael Stroka
collection DOAJ
description The growing interest in data privacy and anonymization presents challenges, as traditional methods such as ordinal discretization often result in information loss by coarsening metric data. Current research suggests that modeling the latent distributions of ordinal classes can reduce the effectiveness of anonymization and increase traceability. In fact, combining probability distributions with a small training sample can effectively infer true metric values from discrete information, depending on the model and data complexity. Our method uses metric values and ordinal classes to model latent normal distributions for each discrete class. This approach, applied with both linear and Bayesian linear regression, aims to enhance supervised learning models. Evaluated with synthetic datasets and real-world datasets from UCI and Kaggle, our method shows improved mean point estimation and narrower prediction intervals compared to the baseline. With 5–10% training data randomly split from each dataset population, it achieves an average 10% reduction in <i>MSE</i> and a ~5–10% increase in <i>R</i>² on out-of-sample test data overall.
format Article
id doaj-art-99ee90f3a6f14be78a801be1353ebec2
institution DOAJ
issn 2571-905X
language English
publishDate 2024-10-01
publisher MDPI AG
record_format Article
series Stats
spelling doaj-art-99ee90f3a6f14be78a801be1353ebec22025-08-20T02:56:58ZengMDPI AGStats2571-905X2024-10-01741189120810.3390/stats7040070Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size ProblemStefan Michael Stroka0Christian Heumann1Department of Statistics, Ludwig-Maximilians-University Munich, 80539 Munich, GermanyDepartment of Statistics, Ludwig-Maximilians-University Munich, 80539 Munich, GermanyThe growing interest in data privacy and anonymization presents challenges, as traditional methods such as ordinal discretization often result in information loss by coarsening metric data. Current research suggests that modeling the latent distributions of ordinal classes can reduce the effectiveness of anonymization and increase traceability. In fact, combining probability distributions with a small training sample can effectively infer true metric values from discrete information, depending on the model and data complexity. Our method uses metric values and ordinal classes to model latent normal distributions for each discrete class. This approach, applied with both linear and Bayesian linear regression, aims to enhance supervised learning models. Evaluated with synthetic datasets and real-world datasets from UCI and Kaggle, our method shows improved mean point estimation and narrower prediction intervals compared to the baseline. With 5–10% training data randomly split from each dataset population, it achieves an average 10% reduction in <i>MSE</i> and a ~5–10% increase in <i>R</i>² on out-of-sample test data overall.https://www.mdpi.com/2571-905X/7/4/70re-identificationmodeling latent class distributionordinal classBayesian inferenceuncertainty quantificationsupervised learning regression enhancement
spellingShingle Stefan Michael Stroka
Christian Heumann
Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem
Stats
re-identification
modeling latent class distribution
ordinal class
Bayesian inference
uncertainty quantification
supervised learning regression enhancement
title Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem
title_full Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem
title_fullStr Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem
title_full_unstemmed Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem
title_short Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem
title_sort is anonymization through discretization reliable modeling latent probability distributions for ordinal data as a solution to the small sample size problem
topic re-identification
modeling latent class distribution
ordinal class
Bayesian inference
uncertainty quantification
supervised learning regression enhancement
url https://www.mdpi.com/2571-905X/7/4/70
work_keys_str_mv AT stefanmichaelstroka isanonymizationthroughdiscretizationreliablemodelinglatentprobabilitydistributionsforordinaldataasasolutiontothesmallsamplesizeproblem
AT christianheumann isanonymizationthroughdiscretizationreliablemodelinglatentprobabilitydistributionsforordinaldataasasolutiontothesmallsamplesizeproblem