A large-scale dataset for Chinese historical document recognition and analysis

Abstract The development of Chinese civilization has produced a vast collection of historical documents. Recognizing and analyzing these documents hold significant value for the research of ancient culture. Recently, researchers have tried to utilize deep-learning techniques to automate recognition...

Full description

Saved in:

Bibliographic Details
Main Authors:	Yongxin Shi, Dezhi Peng, Yuyi Zhang, Jiahuan Cao, Lianwen Jin
Format:	Article
Language:	English
Published:	Nature Portfolio 2025-01-01
Series:	Scientific Data
Online Access:	https://doi.org/10.1038/s41597-025-04495-x
Tags:	Add Tag No Tags, Be the first to tag this record!

_version_	1832571981836320768
author	Yongxin Shi Dezhi Peng Yuyi Zhang Jiahuan Cao Lianwen Jin
author_facet	Yongxin Shi Dezhi Peng Yuyi Zhang Jiahuan Cao Lianwen Jin
author_sort	Yongxin Shi
collection	DOAJ
description	Abstract The development of Chinese civilization has produced a vast collection of historical documents. Recognizing and analyzing these documents hold significant value for the research of ancient culture. Recently, researchers have tried to utilize deep-learning techniques to automate recognition and analysis. However, existing Chinese historical document datasets, which are heavily relied upon by deep-learning models, suffer from limited data scale, insufficient character category, and lack of book-level annotation. To fill this gap, we introduce HisDoc1B, a large-scale dataset for Chinese historical document recognition and analysis. The HisDoc1B comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in scale. Additionally, it is the only dataset with book-level annotations and punctuation annotations. Furthermore, extensive experiments demonstrate the high quality and practical utility of the proposed HisDoc1B. We believe that HisDoc1B could provide valuable resources to boost the advancement of research in this domain.
format	Article
id	doaj-art-17ff5ca0cbd145e587be8982110ae7be
institution	Kabale University
issn	2052-4463
language	English
publishDate	2025-01-01
publisher	Nature Portfolio
record_format	Article
series	Scientific Data
spelling	doaj-art-17ff5ca0cbd145e587be8982110ae7be2025-02-02T12:08:24ZengNature PortfolioScientific Data2052-44632025-01-0112111010.1038/s41597-025-04495-xA large-scale dataset for Chinese historical document recognition and analysisYongxin Shi0Dezhi Peng1Yuyi Zhang2Jiahuan Cao3Lianwen Jin4School of Electronic and Information Engineering, South China University of TechnologySchool of Electronic and Information Engineering, South China University of TechnologySchool of Electronic and Information Engineering, South China University of TechnologySchool of Electronic and Information Engineering, South China University of TechnologySchool of Electronic and Information Engineering, South China University of TechnologyAbstract The development of Chinese civilization has produced a vast collection of historical documents. Recognizing and analyzing these documents hold significant value for the research of ancient culture. Recently, researchers have tried to utilize deep-learning techniques to automate recognition and analysis. However, existing Chinese historical document datasets, which are heavily relied upon by deep-learning models, suffer from limited data scale, insufficient character category, and lack of book-level annotation. To fill this gap, we introduce HisDoc1B, a large-scale dataset for Chinese historical document recognition and analysis. The HisDoc1B comprises 40,281 books, over 3 million document images, and over 1 billion characters across 30,615 character categories. To the best of our knowledge, HisDoc1B is the largest dataset in the field, surpassing existing datasets by more than 200 times in scale. Additionally, it is the only dataset with book-level annotations and punctuation annotations. Furthermore, extensive experiments demonstrate the high quality and practical utility of the proposed HisDoc1B. We believe that HisDoc1B could provide valuable resources to boost the advancement of research in this domain.https://doi.org/10.1038/s41597-025-04495-x
spellingShingle	Yongxin Shi Dezhi Peng Yuyi Zhang Jiahuan Cao Lianwen Jin A large-scale dataset for Chinese historical document recognition and analysis Scientific Data
title	A large-scale dataset for Chinese historical document recognition and analysis
title_full	A large-scale dataset for Chinese historical document recognition and analysis
title_fullStr	A large-scale dataset for Chinese historical document recognition and analysis
title_full_unstemmed	A large-scale dataset for Chinese historical document recognition and analysis
title_short	A large-scale dataset for Chinese historical document recognition and analysis
title_sort	large scale dataset for chinese historical document recognition and analysis
url	https://doi.org/10.1038/s41597-025-04495-x
work_keys_str_mv	AT yongxinshi alargescaledatasetforchinesehistoricaldocumentrecognitionandanalysis AT dezhipeng alargescaledatasetforchinesehistoricaldocumentrecognitionandanalysis AT yuyizhang alargescaledatasetforchinesehistoricaldocumentrecognitionandanalysis AT jiahuancao alargescaledatasetforchinesehistoricaldocumentrecognitionandanalysis AT lianwenjin alargescaledatasetforchinesehistoricaldocumentrecognitionandanalysis AT yongxinshi largescaledatasetforchinesehistoricaldocumentrecognitionandanalysis AT dezhipeng largescaledatasetforchinesehistoricaldocumentrecognitionandanalysis AT yuyizhang largescaledatasetforchinesehistoricaldocumentrecognitionandanalysis AT jiahuancao largescaledatasetforchinesehistoricaldocumentrecognitionandanalysis AT lianwenjin largescaledatasetforchinesehistoricaldocumentrecognitionandanalysis

A large-scale dataset for Chinese historical document recognition and analysis

Similar Items