When Two are Better Than One: Synthesizing Heavily Unbalanced Data

Nowadays, data is king and if treated and used properly it promises to give organizations a competitive edge over rivals by enabling them to develop and design Intelligent Systems to improve their services. However, they need to fully comply with not only ethical but also regulatory obligations, whe...

Full description

Saved in:

Bibliographic Details
Main Authors:	Francisco Ferreira, Nuno Lourenco, Bruno Cabral, Joao Paulo Fernandes
Format:	Article
Language:	English
Published:	IEEE 2021-01-01
Series:	IEEE Access
Subjects:	Fraud detection generative adversarial networks privacy machine learning synthetic data generation tabular data
Online Access:	https://ieeexplore.ieee.org/document/9606863/
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1850106916198416384
author	Francisco Ferreira Nuno Lourenco Bruno Cabral Joao Paulo Fernandes
author_facet	Francisco Ferreira Nuno Lourenco Bruno Cabral Joao Paulo Fernandes
author_sort	Francisco Ferreira
collection	DOAJ
description	Nowadays, data is king and if treated and used properly it promises to give organizations a competitive edge over rivals by enabling them to develop and design Intelligent Systems to improve their services. However, they need to fully comply with not only ethical but also regulatory obligations, where, e.g., privacy (strictly) needs to be respected when using or sharing data, thus protecting both the interests of users and organizations. Fraud Detection systems are examples of such systems where Machine Learning algorithms leverage information to classify financial transactions as legitimate or illicit. The data used to create these solutions is usually highly structured and contains categorical and continuous features characterised by complex distributions. One of the main challenges of fraud detection is concerned with the scarcity of fraudulent instances which results in highly unbalanced datasets. Additionally, privacy is crucial, and it is usually forbidden, or not possible, to share the data of organizations and individuals for creating or improving models.In this paper we propose a framework for private data sharing based on synthetic data generation using <italic>Generative Adversarial Networks (GAN)</italic> that learns the specificities of financial transactions data and generates fictitious data that keeps the utility of the original datasets. Our proposal, called Duo-GAN, uses two GAN generators to handle the data imbalance problem, one generator for fraudulent instances and the other for legitimate instances. With this approach, we observed, at most, a 5% disparity in F1 scores between classifiers trained and tested with actual data and the ones trained with synthetic data and tested with actual data.
format	Article
id	doaj-art-baae542a3fa84f2aa6f952e2cef0f249
institution	OA Journals
issn	2169-3536
language	English
publishDate	2021-01-01
publisher	IEEE
record_format	Article
series	IEEE Access
spelling	doaj-art-baae542a3fa84f2aa6f952e2cef0f2492025-08-20T02:38:42ZengIEEEIEEE Access2169-35362021-01-01915045915046910.1109/ACCESS.2021.31266569606863When Two are Better Than One: Synthesizing Heavily Unbalanced DataFrancisco Ferreira0https://orcid.org/0000-0001-6060-4971Nuno Lourenco1https://orcid.org/0000-0002-2154-0642Bruno Cabral2https://orcid.org/0000-0001-9699-1133Joao Paulo Fernandes3https://orcid.org/0000-0002-1952-9460CISUC, DEI, University of Coimbra, Coimbra, PortugalCISUC, DEI, University of Coimbra, Coimbra, PortugalCISUC, DEI, University of Coimbra, Coimbra, PortugalLIACC, DEI, Faculdade de Engenharia da Universidade do Porto, Porto, PortugalNowadays, data is king and if treated and used properly it promises to give organizations a competitive edge over rivals by enabling them to develop and design Intelligent Systems to improve their services. However, they need to fully comply with not only ethical but also regulatory obligations, where, e.g., privacy (strictly) needs to be respected when using or sharing data, thus protecting both the interests of users and organizations. Fraud Detection systems are examples of such systems where Machine Learning algorithms leverage information to classify financial transactions as legitimate or illicit. The data used to create these solutions is usually highly structured and contains categorical and continuous features characterised by complex distributions. One of the main challenges of fraud detection is concerned with the scarcity of fraudulent instances which results in highly unbalanced datasets. Additionally, privacy is crucial, and it is usually forbidden, or not possible, to share the data of organizations and individuals for creating or improving models.In this paper we propose a framework for private data sharing based on synthetic data generation using <italic>Generative Adversarial Networks (GAN)</italic> that learns the specificities of financial transactions data and generates fictitious data that keeps the utility of the original datasets. Our proposal, called Duo-GAN, uses two GAN generators to handle the data imbalance problem, one generator for fraudulent instances and the other for legitimate instances. With this approach, we observed, at most, a 5% disparity in F1 scores between classifiers trained and tested with actual data and the ones trained with synthetic data and tested with actual data.https://ieeexplore.ieee.org/document/9606863/Fraud detectiongenerative adversarial networksprivacymachine learningsynthetic data generationtabular data
spellingShingle	Francisco Ferreira Nuno Lourenco Bruno Cabral Joao Paulo Fernandes When Two are Better Than One: Synthesizing Heavily Unbalanced Data IEEE Access Fraud detection generative adversarial networks privacy machine learning synthetic data generation tabular data
title	When Two are Better Than One: Synthesizing Heavily Unbalanced Data
title_full	When Two are Better Than One: Synthesizing Heavily Unbalanced Data
title_fullStr	When Two are Better Than One: Synthesizing Heavily Unbalanced Data
title_full_unstemmed	When Two are Better Than One: Synthesizing Heavily Unbalanced Data
title_short	When Two are Better Than One: Synthesizing Heavily Unbalanced Data
title_sort	when two are better than one synthesizing heavily unbalanced data
topic	Fraud detection generative adversarial networks privacy machine learning synthetic data generation tabular data
url	https://ieeexplore.ieee.org/document/9606863/
work_keys_str_mv	AT franciscoferreira whentwoarebetterthanonesynthesizingheavilyunbalanceddata AT nunolourenco whentwoarebetterthanonesynthesizingheavilyunbalanceddata AT brunocabral whentwoarebetterthanonesynthesizingheavilyunbalanceddata AT joaopaulofernandes whentwoarebetterthanonesynthesizingheavilyunbalanceddata

When Two are Better Than One: Synthesizing Heavily Unbalanced Data

Similar Items