Graph Deep Active Learning Framework for Data Deduplication

With the advent of the era of big data, an increasing amount of duplicate data are expressed in different forms. In order to reduce redundant data storage and improve data quality, data deduplication technology has never become more significant than nowadays. It is usually necessary to connect multi...

Full description

Saved in:
Bibliographic Details
Main Authors: Huan Cao, Shengdong Du, Jie Hu, Yan Yang, Shi-Jinn Horng, Tianrui Li
Format: Article
Language:English
Published: Tsinghua University Press 2024-09-01
Series:Big Data Mining and Analytics
Subjects:
Online Access:https://www.sciopen.com/article/10.26599/BDMA.2023.9020040
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1832544411405254656
author Huan Cao
Shengdong Du
Jie Hu
Yan Yang
Shi-Jinn Horng
Tianrui Li
author_facet Huan Cao
Shengdong Du
Jie Hu
Yan Yang
Shi-Jinn Horng
Tianrui Li
author_sort Huan Cao
collection DOAJ
description With the advent of the era of big data, an increasing amount of duplicate data are expressed in different forms. In order to reduce redundant data storage and improve data quality, data deduplication technology has never become more significant than nowadays. It is usually necessary to connect multiple data tables and identify different records pointing to the same entity, especially in the case of multi-source data deduplication. Active learning trains the model by selecting the data items with the maximum information divergence and reduces the data to be annotated, which has unique advantages in dealing with big data annotations. However, most of the current active learning methods only employ classical entity matching and are rarely applied to data deduplication tasks. To fill this research gap, we propose a novel graph deep active learning framework for data deduplication, which is based on similarity algorithms combined with the bidirectional encoder representations from transformers (BERT) model to extract the deep similarity features of multi-source data records, and first introduce the graph active learning strategy to build a clean graph to filter the data that needs to be labeled, which is used to delete the duplicate data that retain the most information. Experimental results on real-world datasets demonstrate that the proposed method outperforms state-of-the-art active learning models on data deduplication tasks.
format Article
id doaj-art-7418ce9816034b249a0f1e842d8d83cd
institution Kabale University
issn 2096-0654
language English
publishDate 2024-09-01
publisher Tsinghua University Press
record_format Article
series Big Data Mining and Analytics
spelling doaj-art-7418ce9816034b249a0f1e842d8d83cd2025-02-03T10:19:58ZengTsinghua University PressBig Data Mining and Analytics2096-06542024-09-017375376410.26599/BDMA.2023.9020040Graph Deep Active Learning Framework for Data DeduplicationHuan Cao0Shengdong Du1Jie Hu2Yan Yang3Shi-Jinn Horng4Tianrui Li5School of Computing and Artificial Intelligence, Southwest Jiaotong University, and also with the Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Chengdu 611756, ChinaSchool of Computing and Artificial Intelligence, Southwest Jiaotong University, and also with the Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Chengdu 611756, ChinaSchool of Computing and Artificial Intelligence, Southwest Jiaotong University, and also with the Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Chengdu 611756, ChinaSchool of Computing and Artificial Intelligence, Southwest Jiaotong University, and also with the Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Chengdu 611756, ChinaCollege of Information and Electric Engineering, Asia University, Chongsheng 41359, ChinaSchool of Computing and Artificial Intelligence, Southwest Jiaotong University, and also with the Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Chengdu 611756, ChinaWith the advent of the era of big data, an increasing amount of duplicate data are expressed in different forms. In order to reduce redundant data storage and improve data quality, data deduplication technology has never become more significant than nowadays. It is usually necessary to connect multiple data tables and identify different records pointing to the same entity, especially in the case of multi-source data deduplication. Active learning trains the model by selecting the data items with the maximum information divergence and reduces the data to be annotated, which has unique advantages in dealing with big data annotations. However, most of the current active learning methods only employ classical entity matching and are rarely applied to data deduplication tasks. To fill this research gap, we propose a novel graph deep active learning framework for data deduplication, which is based on similarity algorithms combined with the bidirectional encoder representations from transformers (BERT) model to extract the deep similarity features of multi-source data records, and first introduce the graph active learning strategy to build a clean graph to filter the data that needs to be labeled, which is used to delete the duplicate data that retain the most information. Experimental results on real-world datasets demonstrate that the proposed method outperforms state-of-the-art active learning models on data deduplication tasks.https://www.sciopen.com/article/10.26599/BDMA.2023.9020040data deduplicationactive learningsimilarity calculation
spellingShingle Huan Cao
Shengdong Du
Jie Hu
Yan Yang
Shi-Jinn Horng
Tianrui Li
Graph Deep Active Learning Framework for Data Deduplication
Big Data Mining and Analytics
data deduplication
active learning
similarity calculation
title Graph Deep Active Learning Framework for Data Deduplication
title_full Graph Deep Active Learning Framework for Data Deduplication
title_fullStr Graph Deep Active Learning Framework for Data Deduplication
title_full_unstemmed Graph Deep Active Learning Framework for Data Deduplication
title_short Graph Deep Active Learning Framework for Data Deduplication
title_sort graph deep active learning framework for data deduplication
topic data deduplication
active learning
similarity calculation
url https://www.sciopen.com/article/10.26599/BDMA.2023.9020040
work_keys_str_mv AT huancao graphdeepactivelearningframeworkfordatadeduplication
AT shengdongdu graphdeepactivelearningframeworkfordatadeduplication
AT jiehu graphdeepactivelearningframeworkfordatadeduplication
AT yanyang graphdeepactivelearningframeworkfordatadeduplication
AT shijinnhorng graphdeepactivelearningframeworkfordatadeduplication
AT tianruili graphdeepactivelearningframeworkfordatadeduplication