CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus

Toponyms are fundamental geographical resources characterized by their spatial attributes, distinct from general nouns. While natural language provides rich toponymic data beyond traditional surveying methods, its qualitative ambiguity and inherent uncertainty challenge systematic extraction. Tradit...

Full description

Saved in:
Bibliographic Details
Main Authors: Peng Ye, Yujin Jiang, Yadi Wang
Format: Article
Language:English
Published: MDPI AG 2025-07-01
Series:Information
Subjects:
Online Access:https://www.mdpi.com/2078-2489/16/7/610
Tags: Add Tag
No Tags, Be the first to tag this record!
_version_ 1849733193485254656
author Peng Ye
Yujin Jiang
Yadi Wang
author_facet Peng Ye
Yujin Jiang
Yadi Wang
author_sort Peng Ye
collection DOAJ
description Toponyms are fundamental geographical resources characterized by their spatial attributes, distinct from general nouns. While natural language provides rich toponymic data beyond traditional surveying methods, its qualitative ambiguity and inherent uncertainty challenge systematic extraction. Traditional toponym recognition methods based on part-of-speech tagging only focus on the surface-level features of words, failing to effectively handle complex scenarios such as alias nesting, metonymy ambiguity, and mixed punctuation. This leads to the loss of toponym semantic integrity and deviations in geographic entity recognition. This study proposes a set of Chinese toponym annotation specifications that integrate spatial semantics. By leveraging the XML markup language, it deeply combines the spatial location characteristics of toponyms with linguistic features, and designs fine-grained annotation rules to address the limitations of traditional methods in semantic integrity and geographic entity recognition. On this basis, by integrating multi-source corpora from the <i>Encyclopedia of China: Chinese Geography</i> and <i>People’s Daily</i>, a large-scale Chinese toponym annotation corpus (CHTopo) covering five major categories of toponyms has been constructed. The performance of this annotated corpus was evaluated through toponym recognition, exploring the construction methods of a large-scale, diversified, and high-coverage Chinese toponym annotated corpus from the perspectives of applicability and practicality. CHTopo is conducive to providing foundational support for geographic information extraction, spatial knowledge graphs, and geoparsing research, bridging linguistic and geospatial intelligence.
format Article
id doaj-art-4d309fc0f29f4731b3918df69084088f
institution DOAJ
issn 2078-2489
language English
publishDate 2025-07-01
publisher MDPI AG
record_format Article
series Information
spelling doaj-art-4d309fc0f29f4731b3918df69084088f2025-08-20T03:08:06ZengMDPI AGInformation2078-24892025-07-0116761010.3390/info16070610CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation CorpusPeng Ye0Yujin Jiang1Yadi Wang2Urban Planning and Development Institute, Yangzhou University, Yangzhou 225127, ChinaZhejiang Academy of Culture and Tourism Development, Hangzhou 311231, ChinaSchool of Management, Henan University of Urban Construction, Pingdingshan 467041, ChinaToponyms are fundamental geographical resources characterized by their spatial attributes, distinct from general nouns. While natural language provides rich toponymic data beyond traditional surveying methods, its qualitative ambiguity and inherent uncertainty challenge systematic extraction. Traditional toponym recognition methods based on part-of-speech tagging only focus on the surface-level features of words, failing to effectively handle complex scenarios such as alias nesting, metonymy ambiguity, and mixed punctuation. This leads to the loss of toponym semantic integrity and deviations in geographic entity recognition. This study proposes a set of Chinese toponym annotation specifications that integrate spatial semantics. By leveraging the XML markup language, it deeply combines the spatial location characteristics of toponyms with linguistic features, and designs fine-grained annotation rules to address the limitations of traditional methods in semantic integrity and geographic entity recognition. On this basis, by integrating multi-source corpora from the <i>Encyclopedia of China: Chinese Geography</i> and <i>People’s Daily</i>, a large-scale Chinese toponym annotation corpus (CHTopo) covering five major categories of toponyms has been constructed. The performance of this annotated corpus was evaluated through toponym recognition, exploring the construction methods of a large-scale, diversified, and high-coverage Chinese toponym annotated corpus from the perspectives of applicability and practicality. CHTopo is conducive to providing foundational support for geographic information extraction, spatial knowledge graphs, and geoparsing research, bridging linguistic and geospatial intelligence.https://www.mdpi.com/2078-2489/16/7/610Chinese texttoponymannotated corpustoponym recognition
spellingShingle Peng Ye
Yujin Jiang
Yadi Wang
CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus
Information
Chinese text
toponym
annotated corpus
toponym recognition
title CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus
title_full CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus
title_fullStr CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus
title_full_unstemmed CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus
title_short CHTopo: A Multi-Source Large-Scale Chinese Toponym Annotation Corpus
title_sort chtopo a multi source large scale chinese toponym annotation corpus
topic Chinese text
toponym
annotated corpus
toponym recognition
url https://www.mdpi.com/2078-2489/16/7/610
work_keys_str_mv AT pengye chtopoamultisourcelargescalechinesetoponymannotationcorpus
AT yujinjiang chtopoamultisourcelargescalechinesetoponymannotationcorpus
AT yadiwang chtopoamultisourcelargescalechinesetoponymannotationcorpus