Detective Gadget: Generic Iterative Entity Resolution over Dirty Data

In the era of Big Data, entity resolution (ER), i.e., the process of identifying which records refer to the same entity in the real world, plays a critical role in data-integration tasks, especially in mission-critical applications where accuracy is mandatory, since we want to avoid integrating diff...

Full description

Saved in:
Bibliographic Details
Main Authors: Marcello Buoncristiano, Giansalvatore Mecca, Donatello Santoro, Enzo Veltri
Format: Article
Language:English
Published: MDPI AG 2024-11-01
Series:Data
Subjects:
Online Access:https://www.mdpi.com/2306-5729/9/12/139
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850242237836820480
author Marcello Buoncristiano
Giansalvatore Mecca
Donatello Santoro
Enzo Veltri
author_facet Marcello Buoncristiano
Giansalvatore Mecca
Donatello Santoro
Enzo Veltri
author_sort Marcello Buoncristiano
collection DOAJ
description In the era of Big Data, entity resolution (ER), i.e., the process of identifying which records refer to the same entity in the real world, plays a critical role in data-integration tasks, especially in mission-critical applications where accuracy is mandatory, since we want to avoid integrating different entities or missing matches. However, existing approaches struggle with the challenges posed by rapidly changing data and the presence of dirtiness, which requires an iterative refinement during the time. We present Detective Gadget, a novel system for iterative ER that seamlessly integrates data-cleaning into the ER workflow. Detective Gadgetemploys an alias-based hashing mechanism for fast and scalable matching, check functions to detect and correct mismatches, and a human-in-the-loop framework to refine results through expert feedback. The system iteratively improves data quality and matching accuracy by leveraging evidence from both automated and manual decisions. Extensive experiments across diverse real-world scenarios demonstrate its effectiveness, achieving high accuracy and efficiency while adapting to evolving datasets.
format Article
id doaj-art-e507c58282e14169b23b973df9c73e0d
institution OA Journals
issn 2306-5729
language English
publishDate 2024-11-01
publisher MDPI AG
record_format Article
series Data
spelling doaj-art-e507c58282e14169b23b973df9c73e0d2025-08-20T02:00:21ZengMDPI AGData2306-57292024-11-0191213910.3390/data9120139Detective Gadget: Generic Iterative Entity Resolution over Dirty DataMarcello Buoncristiano0Giansalvatore Mecca1Donatello Santoro2Enzo Veltri3Svelto!—Big Data-Cleaning and Analytics, 85100 Potenza, ItalyDipartimento di Ingegneria, Università degli Studi della Basilicata, 85100 Potenza, ItalyDipartimento di Ingegneria, Università degli Studi della Basilicata, 85100 Potenza, ItalyDipartimento di Ingegneria, Università degli Studi della Basilicata, 85100 Potenza, ItalyIn the era of Big Data, entity resolution (ER), i.e., the process of identifying which records refer to the same entity in the real world, plays a critical role in data-integration tasks, especially in mission-critical applications where accuracy is mandatory, since we want to avoid integrating different entities or missing matches. However, existing approaches struggle with the challenges posed by rapidly changing data and the presence of dirtiness, which requires an iterative refinement during the time. We present Detective Gadget, a novel system for iterative ER that seamlessly integrates data-cleaning into the ER workflow. Detective Gadgetemploys an alias-based hashing mechanism for fast and scalable matching, check functions to detect and correct mismatches, and a human-in-the-loop framework to refine results through expert feedback. The system iteratively improves data quality and matching accuracy by leveraging evidence from both automated and manual decisions. Extensive experiments across diverse real-world scenarios demonstrate its effectiveness, achieving high accuracy and efficiency while adapting to evolving datasets.https://www.mdpi.com/2306-5729/9/12/139entity resolutioniterativealgorithmsdesignperformance
spellingShingle Marcello Buoncristiano
Giansalvatore Mecca
Donatello Santoro
Enzo Veltri
Detective Gadget: Generic Iterative Entity Resolution over Dirty Data
Data
entity resolution
iterative
algorithms
design
performance
title Detective Gadget: Generic Iterative Entity Resolution over Dirty Data
title_full Detective Gadget: Generic Iterative Entity Resolution over Dirty Data
title_fullStr Detective Gadget: Generic Iterative Entity Resolution over Dirty Data
title_full_unstemmed Detective Gadget: Generic Iterative Entity Resolution over Dirty Data
title_short Detective Gadget: Generic Iterative Entity Resolution over Dirty Data
title_sort detective gadget generic iterative entity resolution over dirty data
topic entity resolution
iterative
algorithms
design
performance
url https://www.mdpi.com/2306-5729/9/12/139
work_keys_str_mv AT marcellobuoncristiano detectivegadgetgenericiterativeentityresolutionoverdirtydata
AT giansalvatoremecca detectivegadgetgenericiterativeentityresolutionoverdirtydata
AT donatellosantoro detectivegadgetgenericiterativeentityresolutionoverdirtydata
AT enzoveltri detectivegadgetgenericiterativeentityresolutionoverdirtydata