Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia

Abstract To cope with the large number of publications, more and more researchers are automatically extracting data of interest using natural language processing methods based on supervised learning. Much data, especially in the natural and engineering sciences, is quantitative, but there is a lack...

Full description

Saved in:
Bibliographic Details
Main Authors: Jan Göpfert, Patrick Kuckertz, Jann M. Weinand, Detlef Stolten
Format: Article
Language:English
Published: Nature Portfolio 2025-07-01
Series:Scientific Data
Online Access:https://doi.org/10.1038/s41597-025-05499-3
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849388144771727360
author Jan Göpfert
Patrick Kuckertz
Jann M. Weinand
Detlef Stolten
author_facet Jan Göpfert
Patrick Kuckertz
Jann M. Weinand
Detlef Stolten
author_sort Jan Göpfert
collection DOAJ
description Abstract To cope with the large number of publications, more and more researchers are automatically extracting data of interest using natural language processing methods based on supervised learning. Much data, especially in the natural and engineering sciences, is quantitative, but there is a lack of datasets for identifying quantities and their context in text. To address this issue, we present two large datasets based on Wikipedia and Wikidata: Wiki-Quantities is a dataset consisting of over 1.2 million annotated quantities in the English-language Wikipedia. Wiki-Measurements is a dataset of 38 738 annotated quantities in the English-language Wikipedia along with their respective measured entity, property, and optional qualifiers. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements found 100% and 84-94% correct, respectively. The datasets can be used in pipeline approaches to measurement extraction, where quantities are first identified and then their measurement context. To allow reproduction of this work using newer or different versions of Wikipedia and Wikidata, we publish the code used to create the datasets along with the data.
format Article
id doaj-art-5e079eb7211b4dfd87b7d12fe252ff05
institution Kabale University
issn 2052-4463
language English
publishDate 2025-07-01
publisher Nature Portfolio
record_format Article
series Scientific Data
spelling doaj-art-5e079eb7211b4dfd87b7d12fe252ff052025-08-20T03:42:23ZengNature PortfolioScientific Data2052-44632025-07-0112111610.1038/s41597-025-05499-3Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from WikipediaJan Göpfert0Patrick Kuckertz1Jann M. Weinand2Detlef Stolten3Forschungszentrum Jülich GmbH, Institute of Climate and Energy Systems, Jülich Systems AnalysisForschungszentrum Jülich GmbH, Institute of Climate and Energy Systems, Jülich Systems AnalysisForschungszentrum Jülich GmbH, Institute of Climate and Energy Systems, Jülich Systems AnalysisForschungszentrum Jülich GmbH, Institute of Climate and Energy Systems, Jülich Systems AnalysisAbstract To cope with the large number of publications, more and more researchers are automatically extracting data of interest using natural language processing methods based on supervised learning. Much data, especially in the natural and engineering sciences, is quantitative, but there is a lack of datasets for identifying quantities and their context in text. To address this issue, we present two large datasets based on Wikipedia and Wikidata: Wiki-Quantities is a dataset consisting of over 1.2 million annotated quantities in the English-language Wikipedia. Wiki-Measurements is a dataset of 38 738 annotated quantities in the English-language Wikipedia along with their respective measured entity, property, and optional qualifiers. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements found 100% and 84-94% correct, respectively. The datasets can be used in pipeline approaches to measurement extraction, where quantities are first identified and then their measurement context. To allow reproduction of this work using newer or different versions of Wikipedia and Wikidata, we publish the code used to create the datasets along with the data.https://doi.org/10.1038/s41597-025-05499-3
spellingShingle Jan Göpfert
Patrick Kuckertz
Jann M. Weinand
Detlef Stolten
Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia
Scientific Data
title Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia
title_full Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia
title_fullStr Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia
title_full_unstemmed Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia
title_short Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia
title_sort wiki quantities and wiki measurements datasets of quantities and their measurement context from wikipedia
url https://doi.org/10.1038/s41597-025-05499-3
work_keys_str_mv AT jangopfert wikiquantitiesandwikimeasurementsdatasetsofquantitiesandtheirmeasurementcontextfromwikipedia
AT patrickkuckertz wikiquantitiesandwikimeasurementsdatasetsofquantitiesandtheirmeasurementcontextfromwikipedia
AT jannmweinand wikiquantitiesandwikimeasurementsdatasetsofquantitiesandtheirmeasurementcontextfromwikipedia
AT detlefstolten wikiquantitiesandwikimeasurementsdatasetsofquantitiesandtheirmeasurementcontextfromwikipedia