Gold standard, multi-genre dataset for named entity recognition and linking

Abstract In our study, we introduce a new dataset designed for the evaluation of entity-linking systems. Entity Linking (EL) involves identifying specific segments in a text so-called mentions and linking them to relevant entries in an external Knowledge Base (KB). EL is a challenging task with nume...

Full description

Saved in:
Bibliographic Details
Main Authors: Szymon Olewniczak, Julian Szymański
Format: Article
Language:English
Published: Nature Portfolio 2025-06-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-025-05274-4
Tags: Add Tag
No Tags, Be the first to tag this record!
Description
Summary:Abstract In our study, we introduce a new dataset designed for the evaluation of entity-linking systems. Entity Linking (EL) involves identifying specific segments in a text so-called mentions and linking them to relevant entries in an external Knowledge Base (KB). EL is a challenging task with numerous complexities, making it vital to have access to high-quality data for testing. Our dataset is unique in that it encompasses texts from various domains, contrasting with the common focus on single domains, such as newspaper news, in most current datasets. Furthermore, we have annotated each identified text segment with its corresponding entity type, enhancing the dataset’s usefulness and reliability. This dataset employs Wikipedia as its Knowledge Base, which is the prevalent choice for general domain entity linking systems. The dataset is available to download from https://doi.org/10.34808/f3q9-9k64 .
ISSN:2052-4463