SynC2S: An Efficient Method for Synthesizing Tabular Data With a Learnable Pre-Processing

There has been a growing demand to access large public datasets to extract valuable insights or enhance their services. However, this also involves risks, such as privacy breaches and unauthorized data exposure. Data synthesis has emerged as a popular technique to address privacy preservation and da...

Full description

Saved in:
Bibliographic Details
Main Authors: Jiwoo Kim, Seri Park, Junsung Koh, Dongha Kim
Format: Article
Language:English
Published: IEEE 2025-01-01
Series:IEEE Access
Subjects:
Online Access:https://ieeexplore.ieee.org/document/10704657/
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1841542590539759616
author Jiwoo Kim
Seri Park
Junsung Koh
Dongha Kim
author_facet Jiwoo Kim
Seri Park
Junsung Koh
Dongha Kim
author_sort Jiwoo Kim
collection DOAJ
description There has been a growing demand to access large public datasets to extract valuable insights or enhance their services. However, this also involves risks, such as privacy breaches and unauthorized data exposure. Data synthesis has emerged as a popular technique to address privacy preservation and data usability simultaneously. Recently, numerous methods based on deep learning have been developed, while a clear understanding of their effectiveness is still insufficient, and the necessity for more efficient frameworks persists. In this study, we propose an efficient and theoretically principled method based on a deep generative model to effectively generate high-quality synthetic tabular data. First, we introduce a novel technique called C2Smap–a learnable pre-processing method that automatically transforms continuous distributions into simpler and easily generatable forms. We then develop a conditional generative model with a hierarchical structure and its corresponding learning framework, called HCIWAE, to successfully capture imbalanced categorical distributions. Combining these two components, we coin our method Synthetic data generation with C2Smap (SynC2S). Through comprehensive experimental analyses, we demonstrate the superiority and efficiency of SynC2S in generating synthetic data compared to other recent competitors. Furthermore, as a by-product, we claim that SynC2S could be a favorable option to solve over-sampling tasks, constructing high-performance prediction models by generating synthetic data for the minority class.
format Article
id doaj-art-e386e59e6fd44aeb91f66fe03d7de3f5
institution Kabale University
issn 2169-3536
language English
publishDate 2025-01-01
publisher IEEE
record_format Article
series IEEE Access
spelling doaj-art-e386e59e6fd44aeb91f66fe03d7de3f52025-01-14T00:02:29ZengIEEEIEEE Access2169-35362025-01-01135575559410.1109/ACCESS.2024.347270610704657SynC2S: An Efficient Method for Synthesizing Tabular Data With a Learnable Pre-ProcessingJiwoo Kim0Seri Park1Junsung Koh2Dongha Kim3https://orcid.org/0000-0001-7819-5619Department of Statistics, Sungshin Women’s University, Seoul, Republic of KoreaDepartment of Statistics, Sungshin Women’s University, Seoul, Republic of KoreaKohwoon and Company Corporation, Bucheon-si, Gyeonggi-do, Republic of KoreaDepartment of Statistics, Sungshin Women’s University, Seoul, Republic of KoreaThere has been a growing demand to access large public datasets to extract valuable insights or enhance their services. However, this also involves risks, such as privacy breaches and unauthorized data exposure. Data synthesis has emerged as a popular technique to address privacy preservation and data usability simultaneously. Recently, numerous methods based on deep learning have been developed, while a clear understanding of their effectiveness is still insufficient, and the necessity for more efficient frameworks persists. In this study, we propose an efficient and theoretically principled method based on a deep generative model to effectively generate high-quality synthetic tabular data. First, we introduce a novel technique called C2Smap–a learnable pre-processing method that automatically transforms continuous distributions into simpler and easily generatable forms. We then develop a conditional generative model with a hierarchical structure and its corresponding learning framework, called HCIWAE, to successfully capture imbalanced categorical distributions. Combining these two components, we coin our method Synthetic data generation with C2Smap (SynC2S). Through comprehensive experimental analyses, we demonstrate the superiority and efficiency of SynC2S in generating synthetic data compared to other recent competitors. Furthermore, as a by-product, we claim that SynC2S could be a favorable option to solve over-sampling tasks, constructing high-performance prediction models by generating synthetic data for the minority class.https://ieeexplore.ieee.org/document/10704657/Synthetic data generationcomplex-to-simple mappinghierarchically conditional importance weighted autoencoder
spellingShingle Jiwoo Kim
Seri Park
Junsung Koh
Dongha Kim
SynC2S: An Efficient Method for Synthesizing Tabular Data With a Learnable Pre-Processing
IEEE Access
Synthetic data generation
complex-to-simple mapping
hierarchically conditional importance weighted autoencoder
title SynC2S: An Efficient Method for Synthesizing Tabular Data With a Learnable Pre-Processing
title_full SynC2S: An Efficient Method for Synthesizing Tabular Data With a Learnable Pre-Processing
title_fullStr SynC2S: An Efficient Method for Synthesizing Tabular Data With a Learnable Pre-Processing
title_full_unstemmed SynC2S: An Efficient Method for Synthesizing Tabular Data With a Learnable Pre-Processing
title_short SynC2S: An Efficient Method for Synthesizing Tabular Data With a Learnable Pre-Processing
title_sort sync2s an efficient method for synthesizing tabular data with a learnable pre processing
topic Synthetic data generation
complex-to-simple mapping
hierarchically conditional importance weighted autoencoder
url https://ieeexplore.ieee.org/document/10704657/
work_keys_str_mv AT jiwookim sync2sanefficientmethodforsynthesizingtabulardatawithalearnablepreprocessing
AT seripark sync2sanefficientmethodforsynthesizingtabulardatawithalearnablepreprocessing
AT junsungkoh sync2sanefficientmethodforsynthesizingtabulardatawithalearnablepreprocessing
AT donghakim sync2sanefficientmethodforsynthesizingtabulardatawithalearnablepreprocessing