Gold standard, multi-genre dataset for named entity recognition and linking

Abstract In our study, we introduce a new dataset designed for the evaluation of entity-linking systems. Entity Linking (EL) involves identifying specific segments in a text so-called mentions and linking them to relevant entries in an external Knowledge Base (KB). EL is a challenging task with nume...

Full description

Saved in:
Bibliographic Details
Main Authors: Szymon Olewniczak, Julian Szymański
Format: Article
Language:English
Published: Nature Portfolio 2025-06-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-025-05274-4
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1850222327963320320
author Szymon Olewniczak
Julian Szymański
author_facet Szymon Olewniczak
Julian Szymański
author_sort Szymon Olewniczak
collection DOAJ
description Abstract In our study, we introduce a new dataset designed for the evaluation of entity-linking systems. Entity Linking (EL) involves identifying specific segments in a text so-called mentions and linking them to relevant entries in an external Knowledge Base (KB). EL is a challenging task with numerous complexities, making it vital to have access to high-quality data for testing. Our dataset is unique in that it encompasses texts from various domains, contrasting with the common focus on single domains, such as newspaper news, in most current datasets. Furthermore, we have annotated each identified text segment with its corresponding entity type, enhancing the dataset’s usefulness and reliability. This dataset employs Wikipedia as its Knowledge Base, which is the prevalent choice for general domain entity linking systems. The dataset is available to download from https://doi.org/10.34808/f3q9-9k64 .
format Article
id doaj-art-70ef9ff26edf421a9fc3510dacb3874d
institution OA Journals
issn 2052-4463
language English
publishDate 2025-06-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj-art-70ef9ff26edf421a9fc3510dacb3874d2025-08-20T02:06:23ZengNature PortfolioScientific Data2052-44632025-06-0112112210.1038/s41597-025-05274-4Gold standard, multi-genre dataset for named entity recognition and linkingSzymon Olewniczak0Julian Szymański1Department of Computer Architecture, Faculty of Electronics, Telecommunications and Informatics, Gdańsk University of TechnologyDepartment of Computer Architecture, Faculty of Electronics, Telecommunications and Informatics, Gdańsk University of TechnologyAbstract In our study, we introduce a new dataset designed for the evaluation of entity-linking systems. Entity Linking (EL) involves identifying specific segments in a text so-called mentions and linking them to relevant entries in an external Knowledge Base (KB). EL is a challenging task with numerous complexities, making it vital to have access to high-quality data for testing. Our dataset is unique in that it encompasses texts from various domains, contrasting with the common focus on single domains, such as newspaper news, in most current datasets. Furthermore, we have annotated each identified text segment with its corresponding entity type, enhancing the dataset’s usefulness and reliability. This dataset employs Wikipedia as its Knowledge Base, which is the prevalent choice for general domain entity linking systems. The dataset is available to download from https://doi.org/10.34808/f3q9-9k64 .https://doi.org/10.1038/s41597-025-05274-4
spellingShingle Szymon Olewniczak
Julian Szymański
Gold standard, multi-genre dataset for named entity recognition and linking
Scientific Data
title Gold standard, multi-genre dataset for named entity recognition and linking
title_full Gold standard, multi-genre dataset for named entity recognition and linking
title_fullStr Gold standard, multi-genre dataset for named entity recognition and linking
title_full_unstemmed Gold standard, multi-genre dataset for named entity recognition and linking
title_short Gold standard, multi-genre dataset for named entity recognition and linking
title_sort gold standard multi genre dataset for named entity recognition and linking
url https://doi.org/10.1038/s41597-025-05274-4
work_keys_str_mv AT szymonolewniczak goldstandardmultigenredatasetfornamedentityrecognitionandlinking
AT julianszymanski goldstandardmultigenredatasetfornamedentityrecognitionandlinking