Research on a denoising model for entity-relation extraction using hierarchical contrastive learning with distant supervision

Abstract Distant supervision is a technique that utilizes knowledge base information to automatically generate labels for text samples, enabling the large-scale creation of labeled data. However, this approach often encounters the issue of noisy labels in practice, which arises from inaccuracies in...

Full description

Saved in:
Bibliographic Details
Main Authors: Ayiguli Halike, Aishan Wumaier, Kahaerjiang Abiderexiti, Tuergen Yibulayin
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Reports
Online Access:https://doi.org/10.1038/s41598-025-04474-7
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract Distant supervision is a technique that utilizes knowledge base information to automatically generate labels for text samples, enabling the large-scale creation of labeled data. However, this approach often encounters the issue of noisy labels in practice, which arises from inaccuracies in the alignment between the text and the knowledge base, leading to erroneous generated labels that adversely affect the model’s performance. In the task of relation extraction, such noise not only diminishes extraction accuracy but may also cause the model to favor the recognition of common relations while neglecting long-tail relations. To address these issues, this paper proposes an innovative hierarchical contrastive learning framework, specifically applied to the Uyghur language using pre-trained models for and CINO minority language modeling. This framework effectively integrates both global structural information and local fine-grained interactions to reduce noise within sentences. Specifically, a three-layer learning architecture is designed, which incorporates interactions at different levels and employs a multi-head self-attention mechanism to generate denoised context-aware representations, referred to as multi-granular re-contextualization. Additionally, a dynamic gradient adversarial perturbation data augmentation strategy is introduced to provide pseudo-positive samples for contrastive learning, further enhancing the model’s capabilities in recognizing rare relations. Experimental results demonstrate that the proposed framework significantly improves accuracy and robustness in the task of Uyghur relation extraction, validating its effectiveness and innovativeness. This research offers new perspectives and methodologies for the field of distant supervision in relation extraction, advancing further development in this area.
ISSN:2045-2322