Graph Deep Active Learning Framework for Data Deduplication
With the advent of the era of big data, an increasing amount of duplicate data are expressed in different forms. In order to reduce redundant data storage and improve data quality, data deduplication technology has never become more significant than nowadays. It is usually necessary to connect multi...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Tsinghua University Press
2024-09-01
|
Series: | Big Data Mining and Analytics |
Subjects: | |
Online Access: | https://www.sciopen.com/article/10.26599/BDMA.2023.9020040 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
_version_ | 1832544411405254656 |
---|---|
author | Huan Cao Shengdong Du Jie Hu Yan Yang Shi-Jinn Horng Tianrui Li |
author_facet | Huan Cao Shengdong Du Jie Hu Yan Yang Shi-Jinn Horng Tianrui Li |
author_sort | Huan Cao |
collection | DOAJ |
description | With the advent of the era of big data, an increasing amount of duplicate data are expressed in different forms. In order to reduce redundant data storage and improve data quality, data deduplication technology has never become more significant than nowadays. It is usually necessary to connect multiple data tables and identify different records pointing to the same entity, especially in the case of multi-source data deduplication. Active learning trains the model by selecting the data items with the maximum information divergence and reduces the data to be annotated, which has unique advantages in dealing with big data annotations. However, most of the current active learning methods only employ classical entity matching and are rarely applied to data deduplication tasks. To fill this research gap, we propose a novel graph deep active learning framework for data deduplication, which is based on similarity algorithms combined with the bidirectional encoder representations from transformers (BERT) model to extract the deep similarity features of multi-source data records, and first introduce the graph active learning strategy to build a clean graph to filter the data that needs to be labeled, which is used to delete the duplicate data that retain the most information. Experimental results on real-world datasets demonstrate that the proposed method outperforms state-of-the-art active learning models on data deduplication tasks. |
format | Article |
id | doaj-art-7418ce9816034b249a0f1e842d8d83cd |
institution | Kabale University |
issn | 2096-0654 |
language | English |
publishDate | 2024-09-01 |
publisher | Tsinghua University Press |
record_format | Article |
series | Big Data Mining and Analytics |
spelling | doaj-art-7418ce9816034b249a0f1e842d8d83cd2025-02-03T10:19:58ZengTsinghua University PressBig Data Mining and Analytics2096-06542024-09-017375376410.26599/BDMA.2023.9020040Graph Deep Active Learning Framework for Data DeduplicationHuan Cao0Shengdong Du1Jie Hu2Yan Yang3Shi-Jinn Horng4Tianrui Li5School of Computing and Artificial Intelligence, Southwest Jiaotong University, and also with the Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Chengdu 611756, ChinaSchool of Computing and Artificial Intelligence, Southwest Jiaotong University, and also with the Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Chengdu 611756, ChinaSchool of Computing and Artificial Intelligence, Southwest Jiaotong University, and also with the Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Chengdu 611756, ChinaSchool of Computing and Artificial Intelligence, Southwest Jiaotong University, and also with the Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Chengdu 611756, ChinaCollege of Information and Electric Engineering, Asia University, Chongsheng 41359, ChinaSchool of Computing and Artificial Intelligence, Southwest Jiaotong University, and also with the Engineering Research Center of Sustainable Urban Intelligent Transportation, Ministry of Education, Chengdu 611756, ChinaWith the advent of the era of big data, an increasing amount of duplicate data are expressed in different forms. In order to reduce redundant data storage and improve data quality, data deduplication technology has never become more significant than nowadays. It is usually necessary to connect multiple data tables and identify different records pointing to the same entity, especially in the case of multi-source data deduplication. Active learning trains the model by selecting the data items with the maximum information divergence and reduces the data to be annotated, which has unique advantages in dealing with big data annotations. However, most of the current active learning methods only employ classical entity matching and are rarely applied to data deduplication tasks. To fill this research gap, we propose a novel graph deep active learning framework for data deduplication, which is based on similarity algorithms combined with the bidirectional encoder representations from transformers (BERT) model to extract the deep similarity features of multi-source data records, and first introduce the graph active learning strategy to build a clean graph to filter the data that needs to be labeled, which is used to delete the duplicate data that retain the most information. Experimental results on real-world datasets demonstrate that the proposed method outperforms state-of-the-art active learning models on data deduplication tasks.https://www.sciopen.com/article/10.26599/BDMA.2023.9020040data deduplicationactive learningsimilarity calculation |
spellingShingle | Huan Cao Shengdong Du Jie Hu Yan Yang Shi-Jinn Horng Tianrui Li Graph Deep Active Learning Framework for Data Deduplication Big Data Mining and Analytics data deduplication active learning similarity calculation |
title | Graph Deep Active Learning Framework for Data Deduplication |
title_full | Graph Deep Active Learning Framework for Data Deduplication |
title_fullStr | Graph Deep Active Learning Framework for Data Deduplication |
title_full_unstemmed | Graph Deep Active Learning Framework for Data Deduplication |
title_short | Graph Deep Active Learning Framework for Data Deduplication |
title_sort | graph deep active learning framework for data deduplication |
topic | data deduplication active learning similarity calculation |
url | https://www.sciopen.com/article/10.26599/BDMA.2023.9020040 |
work_keys_str_mv | AT huancao graphdeepactivelearningframeworkfordatadeduplication AT shengdongdu graphdeepactivelearningframeworkfordatadeduplication AT jiehu graphdeepactivelearningframeworkfordatadeduplication AT yanyang graphdeepactivelearningframeworkfordatadeduplication AT shijinnhorng graphdeepactivelearningframeworkfordatadeduplication AT tianruili graphdeepactivelearningframeworkfordatadeduplication |