Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia
Abstract To cope with the large number of publications, more and more researchers are automatically extracting data of interest using natural language processing methods based on supervised learning. Much data, especially in the natural and engineering sciences, is quantitative, but there is a lack...
Saved in:
| Main Authors: | , , , |
|---|---|
| Format: | Article |
| Language: | English |
| Published: |
Nature Portfolio
2025-07-01
|
| Series: | Scientific Data |
| Online Access: | https://doi.org/10.1038/s41597-025-05499-3 |
| Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
| _version_ | 1849388144771727360 |
|---|---|
| author | Jan Göpfert Patrick Kuckertz Jann M. Weinand Detlef Stolten |
| author_facet | Jan Göpfert Patrick Kuckertz Jann M. Weinand Detlef Stolten |
| author_sort | Jan Göpfert |
| collection | DOAJ |
| description | Abstract To cope with the large number of publications, more and more researchers are automatically extracting data of interest using natural language processing methods based on supervised learning. Much data, especially in the natural and engineering sciences, is quantitative, but there is a lack of datasets for identifying quantities and their context in text. To address this issue, we present two large datasets based on Wikipedia and Wikidata: Wiki-Quantities is a dataset consisting of over 1.2 million annotated quantities in the English-language Wikipedia. Wiki-Measurements is a dataset of 38 738 annotated quantities in the English-language Wikipedia along with their respective measured entity, property, and optional qualifiers. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements found 100% and 84-94% correct, respectively. The datasets can be used in pipeline approaches to measurement extraction, where quantities are first identified and then their measurement context. To allow reproduction of this work using newer or different versions of Wikipedia and Wikidata, we publish the code used to create the datasets along with the data. |
| format | Article |
| id | doaj-art-5e079eb7211b4dfd87b7d12fe252ff05 |
| institution | Kabale University |
| issn | 2052-4463 |
| language | English |
| publishDate | 2025-07-01 |
| publisher | Nature Portfolio |
| record_format | Article |
| series | Scientific Data |
| spelling | doaj-art-5e079eb7211b4dfd87b7d12fe252ff052025-08-20T03:42:23ZengNature PortfolioScientific Data2052-44632025-07-0112111610.1038/s41597-025-05499-3Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from WikipediaJan Göpfert0Patrick Kuckertz1Jann M. Weinand2Detlef Stolten3Forschungszentrum Jülich GmbH, Institute of Climate and Energy Systems, Jülich Systems AnalysisForschungszentrum Jülich GmbH, Institute of Climate and Energy Systems, Jülich Systems AnalysisForschungszentrum Jülich GmbH, Institute of Climate and Energy Systems, Jülich Systems AnalysisForschungszentrum Jülich GmbH, Institute of Climate and Energy Systems, Jülich Systems AnalysisAbstract To cope with the large number of publications, more and more researchers are automatically extracting data of interest using natural language processing methods based on supervised learning. Much data, especially in the natural and engineering sciences, is quantitative, but there is a lack of datasets for identifying quantities and their context in text. To address this issue, we present two large datasets based on Wikipedia and Wikidata: Wiki-Quantities is a dataset consisting of over 1.2 million annotated quantities in the English-language Wikipedia. Wiki-Measurements is a dataset of 38 738 annotated quantities in the English-language Wikipedia along with their respective measured entity, property, and optional qualifiers. Manual validation of 100 samples each of Wiki-Quantities and Wiki-Measurements found 100% and 84-94% correct, respectively. The datasets can be used in pipeline approaches to measurement extraction, where quantities are first identified and then their measurement context. To allow reproduction of this work using newer or different versions of Wikipedia and Wikidata, we publish the code used to create the datasets along with the data.https://doi.org/10.1038/s41597-025-05499-3 |
| spellingShingle | Jan Göpfert Patrick Kuckertz Jann M. Weinand Detlef Stolten Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia Scientific Data |
| title | Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia |
| title_full | Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia |
| title_fullStr | Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia |
| title_full_unstemmed | Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia |
| title_short | Wiki-Quantities and Wiki-Measurements: Datasets of quantities and their measurement context from Wikipedia |
| title_sort | wiki quantities and wiki measurements datasets of quantities and their measurement context from wikipedia |
| url | https://doi.org/10.1038/s41597-025-05499-3 |
| work_keys_str_mv | AT jangopfert wikiquantitiesandwikimeasurementsdatasetsofquantitiesandtheirmeasurementcontextfromwikipedia AT patrickkuckertz wikiquantitiesandwikimeasurementsdatasetsofquantitiesandtheirmeasurementcontextfromwikipedia AT jannmweinand wikiquantitiesandwikimeasurementsdatasetsofquantitiesandtheirmeasurementcontextfromwikipedia AT detlefstolten wikiquantitiesandwikimeasurementsdatasetsofquantitiesandtheirmeasurementcontextfromwikipedia |