Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem
The growing interest in data privacy and anonymization presents challenges, as traditional methods such as ordinal discretization often result in information loss by coarsening metric data. Current research suggests that modeling the latent distributions of ordinal classes can reduce the effectivene...
Saved in:
| Main Authors: | , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
MDPI AG
2024-10-01
|
| Series: | Stats |
| Subjects: | |
| Online Access: | https://www.mdpi.com/2571-905X/7/4/70 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1850037045967192064 |
|---|---|
| author | Stefan Michael Stroka Christian Heumann |
| author_facet | Stefan Michael Stroka Christian Heumann |
| author_sort | Stefan Michael Stroka |
| collection | DOAJ |
| description | The growing interest in data privacy and anonymization presents challenges, as traditional methods such as ordinal discretization often result in information loss by coarsening metric data. Current research suggests that modeling the latent distributions of ordinal classes can reduce the effectiveness of anonymization and increase traceability. In fact, combining probability distributions with a small training sample can effectively infer true metric values from discrete information, depending on the model and data complexity. Our method uses metric values and ordinal classes to model latent normal distributions for each discrete class. This approach, applied with both linear and Bayesian linear regression, aims to enhance supervised learning models. Evaluated with synthetic datasets and real-world datasets from UCI and Kaggle, our method shows improved mean point estimation and narrower prediction intervals compared to the baseline. With 5–10% training data randomly split from each dataset population, it achieves an average 10% reduction in <i>MSE</i> and a ~5–10% increase in <i>R</i>² on out-of-sample test data overall. |
| format | Article |
| id | doaj-art-99ee90f3a6f14be78a801be1353ebec2 |
| institution | DOAJ |
| issn | 2571-905X |
| language | English |
| publishDate | 2024-10-01 |
| publisher | MDPI AG |
| record_format | Article |
| series | Stats |
| spelling | doaj-art-99ee90f3a6f14be78a801be1353ebec22025-08-20T02:56:58ZengMDPI AGStats2571-905X2024-10-01741189120810.3390/stats7040070Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size ProblemStefan Michael Stroka0Christian Heumann1Department of Statistics, Ludwig-Maximilians-University Munich, 80539 Munich, GermanyDepartment of Statistics, Ludwig-Maximilians-University Munich, 80539 Munich, GermanyThe growing interest in data privacy and anonymization presents challenges, as traditional methods such as ordinal discretization often result in information loss by coarsening metric data. Current research suggests that modeling the latent distributions of ordinal classes can reduce the effectiveness of anonymization and increase traceability. In fact, combining probability distributions with a small training sample can effectively infer true metric values from discrete information, depending on the model and data complexity. Our method uses metric values and ordinal classes to model latent normal distributions for each discrete class. This approach, applied with both linear and Bayesian linear regression, aims to enhance supervised learning models. Evaluated with synthetic datasets and real-world datasets from UCI and Kaggle, our method shows improved mean point estimation and narrower prediction intervals compared to the baseline. With 5–10% training data randomly split from each dataset population, it achieves an average 10% reduction in <i>MSE</i> and a ~5–10% increase in <i>R</i>² on out-of-sample test data overall.https://www.mdpi.com/2571-905X/7/4/70re-identificationmodeling latent class distributionordinal classBayesian inferenceuncertainty quantificationsupervised learning regression enhancement |
| spellingShingle | Stefan Michael Stroka Christian Heumann Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem Stats re-identification modeling latent class distribution ordinal class Bayesian inference uncertainty quantification supervised learning regression enhancement |
| title | Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem |
| title_full | Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem |
| title_fullStr | Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem |
| title_full_unstemmed | Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem |
| title_short | Is Anonymization Through Discretization Reliable? Modeling Latent Probability Distributions for Ordinal Data as a Solution to the Small Sample Size Problem |
| title_sort | is anonymization through discretization reliable modeling latent probability distributions for ordinal data as a solution to the small sample size problem |
| topic | re-identification modeling latent class distribution ordinal class Bayesian inference uncertainty quantification supervised learning regression enhancement |
| url | https://www.mdpi.com/2571-905X/7/4/70 |
| work_keys_str_mv | AT stefanmichaelstroka isanonymizationthroughdiscretizationreliablemodelinglatentprobabilitydistributionsforordinaldataasasolutiontothesmallsamplesizeproblem AT christianheumann isanonymizationthroughdiscretizationreliablemodelinglatentprobabilitydistributionsforordinaldataasasolutiontothesmallsamplesizeproblem |