Synthetic Financial Data: A Case Study Regarding Polish Limited Liability Companies Data

Aim: The aim of this article is to present and evaluate the concept of synthetic data. They are completely new, artificially generated data, but keep the statistical properties of real data. Due to the statistical similarity with real data, they can be used instead of them. This action allows data t...

Full description

Saved in:

Bibliographic Details
Main Author:	Aleksandra Szymura
Format:	Article
Language:	English
Published:	Wydawnictwo Uniwersytetu Ekonomicznego we Wrocławiu 2024-07-01
Series:	Ekonometria
Subjects:	synthetic data generative models financial data ctgan tvae
Online Access:	https://journals.ue.wroc.pl/eada/article/view/1215
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1849730597571788800
author	Aleksandra Szymura
author_facet	Aleksandra Szymura
author_sort	Aleksandra Szymura
collection	DOAJ
description	Aim: The aim of this article is to present and evaluate the concept of synthetic data. They are completely new, artificially generated data, but keep the statistical properties of real data. Due to the statistical similarity with real data, they can be used instead of them. This action allows data to be shared externally while guaranteeing their privacy. Methodology: New datasets were generated based on financial information about Polish limited liability companies, which come from the Orbis database and refers to 2020. To create synthetic data, it was decided to use generative models: CTGAN (based on GAN architecture) and TVAE (based on autoencoders). Finally, the synthetic data were compared with the real ones in terms of statistical properties (e.g. shape of distributions, correlations etc.) and their applicability to machine learning models (PCA method). Results: The Overall Quality Score was higher for the data generated by TVAE, but after examining the results in more detail, it was seen that the data generated by CTGAN had a better quality in terms of keeping the statistical properties of the real data. Comparing the results of the PCA method, TVAE was better than CTGAN. In addition, the TVAE method was less time-consuming than CTGAN. Implications and recommendations: Before publishing the synthetic data externally, it is recommended that the data are generated using several algorithms, evaluating their final results and finally selecting the best option. This action enables the resulting dataset to be of the highest quality. In further research, it is proposed that other algorithms are tested (e.g. CopulaGAN or TableGAN), in an attempt to deal with some of the realistic data problems that were missed in this analysis, such as missing values (the work was carried out with a complete dataset). Data generated in this study may be used to build financial indicators; which in turn could be used to construct company assessment models. Originality/value: Synthetic data help to deal with some of the data limitations, such as data privacy or scarcity. Due to their statistical similarity with real data, it is possible to use them in advanced machine learning methods instead of real datasets. Analysis on high quality synthetic data allows conclusions similar to analysis on real data to be achieved, while retaining privacy and without publishing sensitive data to third parties.
format	Article
id	doaj-art-df1f020ceeaf4fb5b69aaeb750de25e8
institution	DOAJ
issn	2449-9994
language	English
publishDate	2024-07-01
publisher	Wydawnictwo Uniwersytetu Ekonomicznego we Wrocławiu
record_format	Article
series	Ekonometria
spelling	doaj-art-df1f020ceeaf4fb5b69aaeb750de25e82025-08-20T03:08:48ZengWydawnictwo Uniwersytetu Ekonomicznego we WrocławiuEkonometria2449-99942024-07-012821216Synthetic Financial Data: A Case Study Regarding Polish Limited Liability Companies DataAleksandra Szymura0https://orcid.org/0000-0002-9009-3655Wroclaw University of Economics and BusinessAim: The aim of this article is to present and evaluate the concept of synthetic data. They are completely new, artificially generated data, but keep the statistical properties of real data. Due to the statistical similarity with real data, they can be used instead of them. This action allows data to be shared externally while guaranteeing their privacy. Methodology: New datasets were generated based on financial information about Polish limited liability companies, which come from the Orbis database and refers to 2020. To create synthetic data, it was decided to use generative models: CTGAN (based on GAN architecture) and TVAE (based on autoencoders). Finally, the synthetic data were compared with the real ones in terms of statistical properties (e.g. shape of distributions, correlations etc.) and their applicability to machine learning models (PCA method). Results: The Overall Quality Score was higher for the data generated by TVAE, but after examining the results in more detail, it was seen that the data generated by CTGAN had a better quality in terms of keeping the statistical properties of the real data. Comparing the results of the PCA method, TVAE was better than CTGAN. In addition, the TVAE method was less time-consuming than CTGAN. Implications and recommendations: Before publishing the synthetic data externally, it is recommended that the data are generated using several algorithms, evaluating their final results and finally selecting the best option. This action enables the resulting dataset to be of the highest quality. In further research, it is proposed that other algorithms are tested (e.g. CopulaGAN or TableGAN), in an attempt to deal with some of the realistic data problems that were missed in this analysis, such as missing values (the work was carried out with a complete dataset). Data generated in this study may be used to build financial indicators; which in turn could be used to construct company assessment models. Originality/value: Synthetic data help to deal with some of the data limitations, such as data privacy or scarcity. Due to their statistical similarity with real data, it is possible to use them in advanced machine learning methods instead of real datasets. Analysis on high quality synthetic data allows conclusions similar to analysis on real data to be achieved, while retaining privacy and without publishing sensitive data to third parties.https://journals.ue.wroc.pl/eada/article/view/1215synthetic datagenerative modelsfinancial datactgantvae
spellingShingle	Aleksandra Szymura Synthetic Financial Data: A Case Study Regarding Polish Limited Liability Companies Data Ekonometria synthetic data generative models financial data ctgan tvae
title	Synthetic Financial Data: A Case Study Regarding Polish Limited Liability Companies Data
title_full	Synthetic Financial Data: A Case Study Regarding Polish Limited Liability Companies Data
title_fullStr	Synthetic Financial Data: A Case Study Regarding Polish Limited Liability Companies Data
title_full_unstemmed	Synthetic Financial Data: A Case Study Regarding Polish Limited Liability Companies Data
title_short	Synthetic Financial Data: A Case Study Regarding Polish Limited Liability Companies Data
title_sort	synthetic financial data a case study regarding polish limited liability companies data
topic	synthetic data generative models financial data ctgan tvae
url	https://journals.ue.wroc.pl/eada/article/view/1215
work_keys_str_mv	AT aleksandraszymura syntheticfinancialdataacasestudyregardingpolishlimitedliabilitycompaniesdata

Synthetic Financial Data: A Case Study Regarding Polish Limited Liability Companies Data

Similar Items